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Abstract 


This monograph presents a unihed treatment of single- and multi-user problems in Shannon’s information 
theory where we depart from the requirement that the error probability decays asymptotically in the block- 
length. Instead, the error probabilities for various problems are bounded above by a non-vanishing constant 
and the spotlight is shone on achievable coding rates as functions of the growing blocklengths. This represents 
the study of asymptotic estimates with non-vanishing error probabilities. 

In Part I, after reviewing the fundamentals of information theory, we discuss Strassen’s seminal result for 
binary hypothesis testing where the type-I error probability is non-vanishing and the rate of decay of the type- 
II error probability with growing number of independent observations is characterized. In Part II, we use this 
basic hypothesis testing result to develop second- and sometimes, even third-order asymptotic expansions 
for point-to-point communication. Finally in Part III, we consider network information theory problems 
for which the second-order asymptotics are known. These problems include some classes of channels with 
random state, the multiple-encoder distributed lossless source coding (Slepian-Wolf) problem and special 
cases of the Gaussian interference and multiple-access channels. Finally, we discuss avenues for further 
research. 
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Part I 


Fundament als 



Chapter 1 

Introduction 


Claude E. Shannon’s epochal “A Mathematical Theory of Communication” m marks the dawn of the 
digital age. In his seminal paper, Shannon laid the theoretical and mathematical foundations for the basis 
of all communication systems today. It is not an exaggeration to say that his work has had a tremendous 
impact in communications engineering and beyond, in fields as diverse as statistics, economics, biology and 
cryptography, just to name a few. 

It has been more than 65 years since Shannon’s landmark work was published. Along with impressive 
research advances in the field of information theory^ numerous excellent books on various aspects of the 
subject have been written. The author’s favorites include Cover and Thomas [35], Gallager |^, Csiszar and 
Kdrner [33], Han Yeung m and El Carnal and Kim [33]. Is there sufficient motivation to consolidate 
and present another aspect of information theory systematically? It is the author’s hope that the answer is 
in the affirmative. 

To motivate why this is so, let us recapitulate two of Shannon’s major contributions in his 1948 paper. 
First, Shannon showed that to reliably compress a discrete memoryless source (DMS) X" = (Xi,... ,X„) 
where each X^ has the same distribution as a common random variable X, it is sufficient to use H{X) 
bits per source symbol in the limit of large blocklengths n, where H{X) is the Shannon entropy of the 
source. By reliable, it is meant that the probability of incorrect decoding of the source sequence tends to 
zero as the blocklength n grows. Second, Shannon showed that it is possible to reliably transmit a message 
M G {1,... ,2"-^} over a discrete memoryless channel (DMC) W as long as the message rate R is smaller 
than the capacity of the channel C(W). Similarly to the source compression scenario, by reliable, one means 
that the probability of incorrectly decoding M tends to zero as n grows. 

There is, however, substantial motivation to revisit the criterion of having error probabilities vanish 
asymptotically. To state Shannon’s source compression result more formally, let us define M*(P”,e) to 
be the minimum code size for which the length-n DMS P" is compressible to within an error probability 
e G (0,1). Then, Theorem 3 of Shannon’s paper [141] . together with the strong converse for lossless source 
coding [331 Ex. 3.15], states that 

lim —\ogM*{P^,e) = H{X), bits per source symbol. (1.1) 

n—^oo Ti 

Similarly, denoting ,e) as the maximum code size for which it is possible to communicate over a 

DMC IT" such that the average error probability is no larger than e. Theorem 11 of Shannon’s paper [141] . 
together with the strong converse for channel coding [1801 Thm. 2], states that 

lim — logM*^g(IT",e) = C{W), bits per channel use. (1.2) 

n—>-oo Tl 

In many practical communication settings, one does not have the luxury of being able to design an arbitrarily 
long code, so one must settle for a non-vanishing, and hence finite, error probability e. In this finite blocklength 
and non-vanishing error probability setting, how close can one hope to get to the asymptotic limits H{X) and 
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C{W)1 This is, in generai a difficuit question because exact evaiuations of iog M*(P",£) and iog e) 

are intractabie, apart from a few speciai sources and channeis. 

In the eariy years of information theory, Dobrushin [45] . Kemperman [91] and, most prominentiy, 
Strassen [152] studied approximations to iogM*(P”,e) and ioge). These beautifui works were 
iargeiy forgotten untii recentiy, when interest in so-caiied Gaussian approximations were revived by Hayashi [73 
[75] and Poiyanskiy-Poor-Verdii [12211123] Q Strassen showed that the iimiting statement in ( [l.l] ) may be 
refined to yieid the asymptotic expansion 


iogM*(P",e) = nH{X) - ^/nV{X)^-^{e) - ^ iogn + 0(1), 


(1.3) 


where V{X) is known as the source dispersion or the varentropy, terms introduced by Kostina-Verdii [97] 
and Kontoyiannis-Verdii [95] . In (1.3), is the inverse of the Gaussian cumulative distribution function. 


Observe that the first-order term in the asymptotic expansion above, namely H{X), coincides with the (first- 
order) fundamental limit shown by Shannon. From this expansion, one sees that if the error probability is 
fixed to £ < the extra rate above the entropy we have to pay for operating at finite blocklength n with 
admissible error probability £ is approximately yJV{X)— £). Thus, the quantity V{X), which is 
a function of P just like the entropy H{X), quantifies how fast the rates of optimal source codes converge 
to H{X). Similarly, for well-behaved DMCs, under mild conditions, Strassen showed that the limiting 


statement in (1.2) may be refined to 


logM:, 3 (lV",£) = nC{W) + V^K(1V)$-1(£) ^ O(logn) 


(1.4) 


and 14(IF) is a channel parameter known as the e-channel dispersion, a term introduced by Polyanskiy- 
Poor-Verdii [123] . Thus the backoff from capacity at finite blocklengths n and average error probability e is 
approximately y/l4(lF)/n$“^(1 — e). 


1.1 Motivation for this Monograph 


It turns out that Gaussian approximations (first two terms of and ( |1.4[ )) are good proxies to the true non- 
asymptotic fundamental limits (logM*(P”,£) and log M*.^,g(lF", £)) at moderate blocklengths and moderate 
error probabilities for some channels and sources as shown by Polyanskiy-Poor-Verdii [123] and Kostina- 
Verdii m- For error probabilities that are not too small (e.g., £ G [10 ®, 10 ^]), the Gaussian approximation 
is often better than that provided by traditional error exponent or reliability function analysis [39115fi] , where 
the code rate is fixed (below the first-order fundamental limit) and the exponential decay of the error 
probability is analyzed. Recent refinements to error exponent analysis using exact asymptotics [M im 
1135] or saddlepoint approximations [137] are alternative proxies to the non-asymptotic fundamental limits. 
The accuracy of the Gaussian approximation in practical regimes of errors and finite blocklengths gives us 
motivation to study refinements to the first-order fundamental limits of other single- and multi-user problems 
in Shannon theory. 

The study of asymptotic estimates with non-vanishing error probabilities —or more succinctly, fixed error 
asymptotics —also uncovers several interesting phenomena that are not observable from studies of first-order 
fundamental limits in single- and multi-user information theory [331149] . This analysis may give engineers 
deeper insight into the design of practical communication systems. A non-exhaustive list includes: 


1. Shannon showed that separating the tasks of source and channel coding is optimal rate-wise m- As 
we see in Section 4.5.2 (and similarly to the case of error exponents [55])) the case when the 

probability of excess distortion of the source is allowed to be non-vanishing. 


2. Shannon showed that feedback does not increase the capacity of a DMG [142] . It is known, however, 
that variable-length feedback [125] and full output feedback [5] improve on the fixed error asymptotics 
of DMCs. 


^Some of the results in |122l [T2^ were already announced by S. Verdu in his Shannon lecture at the 2007 International 
Symposium on Information Theory (ISIT) in Nice, France. 
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3. It is known that the entropy can be achieved universally for fixed-to-variable length almost lossless 
source coding of a DMS |192| . i.e., the source statistics do not have to be known. The redundancy has 
also been studied for prefix-free codes [27]. In the fixed error setting (a setting complementary to [27]), 
it was shown by Kosut and Sankar [MUM] that universality imposes a penalty in the third-order 
term of the asymptotic expansion in (1.3). 


4. Han showed that the output from any source encoder at the optimal coding rate with asymptotically 
vanishing error appears almost completely random |68| . This is the so-called folklore theorem. Hayashi 
m showed that the analogue of the folklore theorem does not hold when we consider the second-order 
terms in asymptotic expansions (i.e., the second-order asymptotics). 


5. Slepian and Wolf showed that separate encoding of two correlated sources incurs no loss rate-wise 
compared to the situation where side information is also available at all encoders |151j . As we shall see 
in Chapter]^ the fixed error asymptotics in the vicinity of a corner point of the polygonal Slepian-Wolf 
region suggests that side-information at the encoders may be beneficial. 

None of the aforementioned books pi IMl IMl IB71ITM] focus exclusively on the situation where the 
error probabilities of various Shannon-theoretic problems are upper bounded by £ S (0,1) and asymptotic 
expansions or second-order terms are sought. This is what this monograph attempts to do. 


1.2 Preview of this Monograph 


This monograph is organized as follows: In the remaining parts of this chapter, we recap some quantities in 
information theory and results in the method of types PIIMIITI], a particularly useful tool for the study of 
discrete memoryless systems. We also mention some probability bounds that will be used throughout the 
monograph. Most of these bounds are based on refinements of the central limit theorem, and are collectively 
known as Berry-Esseen theorems [13 [52]. In Chapter our study of asymptotic expansions of the form 
(1.3) and (1.4) begins in earnest by revisiting Strassen’s work [152] on binary hypothesis testing where 
the probability of false alarm is constrained to not exceed a positive constant. We find it useful to revisit 
the fundamentals of hypothesis testing as many information-theoretic problems such as source and channel 
coding are intimately related to hypothesis testing. 

Part II of this monograph begins our study of information-theoretic problems starting with lossless and 
lossy compression in Chapter We emphasize, in the first part of this chapter, that (fixed-to-fixed length) 
lossless source coding and binary hypothesis testing are, in fact, the same problem, and so the asymptotic 
expansions developed in Chapter]^ may be directly employed for the purpose of lossless source coding. Lossy 
source coding, however, is more involved. We review the recent works in [86] and m, where the authors 
independently derived asymptotic expansions for the logarithm of the minimum size of a source code that 
reproduces symbols up to a certain distortion, with some admissible probability of excess distortion. Channel 
coding is discussed in Chapter]^ In particular, we study the approximation in (1.4) for both discrete 
memoryless and Gaussian channels. We make it a point here to be precise about the third-order O(logn) 
term. We state conditions on the channel under which the coefficient of the O(logn) term can be determined 
exactly. This leads to some new insights concerning optimum codes for the channel coding problem. Finally, 
we marry source and channel coding in the study of source-channel transmission where the probability of 
excess distortion in reproducing the source is non-vanishing. 


Part III of this monograph contains a sparse sampling of fixed error asymptotic results in network 
information theory. The problems we discuss here have conclusive second-order asymptotic characterizations 
(analogous to the second terms in the asymptotic expansions in (1.3) and ( |1.4| )). They include some channels 
with random state (Chapter]^, such as Costa’s writing on dirty paper [50], mixed DMCs [571 Sec. 3.3], and 
quasi-static single-input-multiple-output (SIMO) fading channels [T5]. Under the fixed error setup, we also 
consider the second-order asymptotics of the Slepian-Wolf HU] distributed lossless source coding problem 
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Figure 1.1: Dependence graph of the chapters in this monograph. An arrow from node s to t means that 
results and techniques in Chapter s are required to understand the material in Chapter t. 


(Chapter 6), the Gaussian interference channel (IC) in the strictly very strong interference regime [52] 
(Chapter 7), and the Gaussian multiple access channel (MAC) with degraded message sets (Chapter]^. 
The MAC with degraded message sets is also known as the cognitive [44] or asymmetric |72l 116711128j MAC 
(A-MAC). Chapter concludes with a brief summary of other results, together with open problems in this 
area of research. A dependence graph of the chapters in the monograph is shown in Fig. o 

This area of information theory —fixed error asymptotics —is vast and, at the same time, rapidly expand¬ 
ing. The results described herein are not meant to be exhaustive and were somewhat dependent on the 
author’s understanding of the subject and his preferences at the time of writing. However, the author has 
made it a point to ensure that results herein are conclusive in nature. This means that the problem is solved 
in the information-theoretic sense in that an operational quantity is equated to an information quantity. In 
terms of asymptotic expansions such as (1.31 and (1.4), by solved, we mean that either the second-order term 
is known or, better still, both the second- and third-order terms are known. Having articulated this, the 
author confesses that there are many relevant information-theoretic problems that can be considered solved 
in the fixed error setting, but have not found their way into this monograph either due to space constraints 
or because it was difficult to meld them seamlessly with the rest of the story. 


1.3 Fundamentals of Information Theory 

In this section, we review some basic information-theoretic quantities. As with every article published in the 
Foundations and Trends in Communications and Information Theory, the reader is expected to have some 
background in information theory. Nevertheless, the only prerequisite required to appreciate this monograph 
is information theory at the level of Cover and Thomas |33| . We will also make extensive use of the method 
of types, for which excellent expositions can be found in EZIEIIIII]. The measure-theoretic foundations of 
probability will not be needed to keep the exposition accessible to as wide an audience as possible. 
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1.3.1 Notation 


The notation we use is reasonably standard and generally follows the books by Csiszar-Kdrner [39] and 
Han ISTj. Random variables (e.g., X) and their realizations (e.g., x) are in upper and lower case respectively. 
Random variables that take on finitely many values have alphabets (support) that are denoted by calligraphic 
font (e.g., X). The cardinality of the finite set X is denoted as \X\. Let the random vector be the vector 
of random variables (Xi,..., X„). We use bold face x = (xi,..., x„) to denote a realization of X". The set 
of all distributions (probability mass functions) supported on alphabet X is denoted as The set of all 

conditional distributions (i.e., channels) with the input alphabet X and the output alphabet y is denoted by 
^{y\X). The joint distribution induced by a marginal distribution P G ^(X) and a channel V G ,^(yjX) 
is denoted as P x V, i.e., 

(P X V)(x,y) := P(x)V(ylx). (1.5) 

The marginal output distribution induced by P and V is denoted as PV, i.e., 

py(y) := ^ P(x)l/(?/|x). (1.6) 

xGX 

If X has distribution P, we sometimes write this as X ~ P. 

Vectors are indicated in lower case bold face (e.g., a) and matrices in upper case bold face (e.g.. A). 
If we write a > b for two vectors a and b of the same length, we mean that aj > bj for every coordinate 
j. The transpose of A is denoted as A'. The vector of all zeros and the identity matrix are denoted as 0 
and I respectively. We sometimes make the lengths and sizes explicit. The £g-norm (for g > 1) of a vector 
V = (ui, ...,Vk) is denoted as ||v||q := (X)i=i 

We use standard asymptotic notation [29|: an G 0(6„) if and only if (iff) limsup„_,,go |a„/6„| < oo; 
a„ e n(6„) iff bn G 0(a„); a„ G 0(6n) iff an G 0(6„) n n(5„); a„ G o(6„) iff limsup^.,.^^ lan/^inl = 0; and 
an G uj{bn) iff liminf„_>oo |an/fen| = cxo. Finally, o„ bn iff lim„_,.oo a„/6„ = 1. 

1.3.2 Information-Theoretic Quantities 

Information-theoretic quantities are denoted in the usual way [391 I49j . All logarithms and exponential 
functions are to the base 2. The entropy of a discrete random variable X with probability distribution 
P G .^(A’) is denoted as 

P(X) = P(P) ^ P(x)logP(x). (1.7) 

xGX 

For the sake of clarity, we will sometimes make the dependence on the distribution P explicit. Similarly 
given a pair of random variables (X, V) with joint distribution P xV G ty‘{X x y), the conditional entropy 
of Y given X is written as 

H{Y\X) = H{V\P) := - V{y\x)logV{y\x). (1.8) 

xex yey 


The joint entropy is denoted as 


H{X,Y)-=H{X)+H{Y\X), or (1.9) 

H{PxV):=H(P) + H{V\P). (1.10) 

The mutual information is a measure of the correlation or dependence between random variables X and Y. 
It is interchangeably denoted as 


I{X-,Y):=H{Y)-H{Y\X), or (1.11) 

I{P,V) := H{PV) - H{V\P). (1.12) 
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Given three random variables (X,Y,Z) with joint distribution P x V x W where V G ^(yjX) and W G 
!X^(Z\X X 3^), the conditional mutual information is 


I{Y-,Z\X) ■= H{Z\X)-H{Z\XY), or (1.13) 

IiV,W\P) := (l-M) 

xGX 


A particularly important quantity is the relative entropy (or Kullback-Leibler divergence [102) 1 between 
P and Q which are distributions on the same hnite support set X. It is defined as the expectation with 
respect to P of the log-likelihood ratio log i-e., 


D(pii<3) ^=i:pwi<>8®. 


(1.15) 


Note that if there exists an x S A for which Q{x) = 0 while P{x) > 0, then the relative entropy D{P\\Q) = oo. 
If for every x S A, if Q{x) — 0 then P{x) = 0, we say that P is absolutely continuous with respect to Q 
and denote this relation hy P Q. In this case, the relative entropy is finite. It is well known that 
D{P\\Q) > 0 and equality holds if and only ii P = Q. Additionally, the conditional relative entropy between 
V,W G t^iyjX) given P G 3^{X) is dehned as 


D{V\\W\P) := ^ P(x)7^(l/(-|x)||IA(-|x)). (1.16) 

x^X 

The mutual information is a special case of the relative entropy. In particular, we have 


/(P, V) = D{P X V\\P X PV) = i:>(A||PA|P). (1.17) 

Furthermore, if Ux is the uniform distribution on A, i.e., Ux{x) = 1/|A| for all x G A, we have 


P(P||[/;r) = -P(P)+log|A|. 


(1.18) 


The dehnition of relative entropy D{P\\Q) can be extended to the case where Q is not necessarily a 
probability measure. In this case non-negativity does not hold in general. An important property we exploit 
is the following: If p, denotes the counting measure (i.e., p{A) = \A\ for A C A), then similarly to (1.18) 


D{P\\p) = -H{P). 


(1.19) 


1.4 The Method of Types 


For finite alphabets, a particularly convenient tool in information theory is the method of types [saisaizi]- 
For a sequence x = (xi,..., x„) G A" in which | A| is finite, its type or empirical distribution is the probability 
mass function 


= 


' l{xi = x}, Vx G A. 


( 1 . 20 ) 


Throughout, we use the notation Ijclause} to mean the indicator function, i.e., this function equals 1 if 
“clause” is true and 0 otherwise. The set of types formed from n-length sequences in A is denoted as 
.^ri(A). This is clearly a subset of ^{X). The type class of P, denoted as Tp, is the set of all sequences of 
length n for which their type is P, i.e.. 


Tp := {x G A” : P, = P} . 


( 1 . 21 ) 
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It is customary to indicate the dependence of 7p on the blocklength n but we suppress this dependence for 
the sake of conciseness throughout. For a sequence x G 7p, the set of all sequences y G 3^" such that (x, y) 
has joint type P x V is the V-shell, denoted as 7y(x). In other words, 

rv(x):={yG:y":Px.y = -PxF}. (1.22) 

The conditional distribution V is also known as the conditional type of y given x. Let 'Pn{y\P) be the set 
of all V G for which the F-shell of a sequence of type P is non-empty. 

We will often times find it useful to consider information-theoretic quantities of empirical distributions. 
All such quantities are denoted using hats. So for example, the empirical entropy of a sequence x G A" is 
denoted as 

iL(x) := P(Px). (1.23) 

The empirical conditional entropy of y G 3^" given x G A” where y G Tv (x) is denoted as 

P(y|x) := P(A|Px). (1.24) 

The empirical mutual information of a pair of sequences (x, y) G A" x 3^" with joint type Px.y = Px x A is 
denoted as 

/(xAy) :=/(Px,A). (1.25) 

The following lemmas form the basis of the method of types. The proofs can be found in |T7l I39j . 
Lemma 1.1 (Type Counting). The sets .^„(A) and Tn{y\P) for P G !y’n{X) satisfy 

\iy>„{X)\<{n+iy^', and |r„(3^; P)| < (n + l)l'^ll^l. (1.26) 


In fact, it is easy to check that 


i(A)| = but (1.26) or its slightly stronger version 


ItT'nWl < (n + l)l^l-i 


(1.27) 


usually suffices for our purposes in this monograph. This key property says that the number of types is 
polynomial in the blocklength n. 

Lemma 1.2 (Size of Type Class). For a type P G .^„(A), the type class Tp C A" satisfies 

|.^„(A)|-iexp(nP(P)) < ITpI <exp(nP(P)). (1.28) 

For a conditional type V G Tniy', P) and a sequence x G Tp, the V-shell 7y(x) C A" satisfies 

\Tniy;P)\-^exp{nH{V\P)) < |ry(x)| < exp (nP(A|P)). (1.29) 

This lemma says that, on the exponential scale, 

|7p| = exp (nP(P)), and |7V(x)| = exp (ni7(A|P)), (1.30) 

where we used the notation a„ = to mean equality up to a polynomial, i.e., there exists polynomials 

Pn and Qn such that On/pn ^ bn < qnOn- We now consider probabilities of sequences. Throughout, for a 

distribution Q G ^{X), we let Q"'(x) be the product distribution, i.e., 

n 

Q^{x) = Y[Q{x,), VxgA”. (1.31) 

i=l 

Lemma 1.3 (Probability of Sequences). If x £ Tp and y G 7y(x), 

(5"(x) = exp ( — nP(P||(5) — nP(P)) and (1-32) 

W”(y|x) = exp ( - nP(A||lF|P) - nH{V\P)). (1.33) 
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This, together with Lemma [L21 leads immediately to the final lemma in this section. 

Lemma 1.4 (Probability of Type Classes). For a type P G 

exp ( - nn(PllQ)) < Q^(Tp) < exp ( - nB(FllQ)). (1.34) 

For a conditional type V G 'fn{y\ P) and a sequence x G Tp, we have 


\yn{y\P)\-^ exp ( - nD{V\\W\P)) < TT"(ry(x)|x) 

< exp(-nD(l/||lT|P)). (1.35) 

The interpretation of this lemma is that the probability that a random i.i.d. (independently and identically 
distributed) sequence X" generated from Q" belongs to the type class 7p is exponentially small with exponent 
D{P\\Q), i.e., 


Q^{Tp)=e^p{-nD{P\\Q)). 


The bounds in (1.35) can be interpreted similarly. 


(1.36) 


1.5 Probability Bounds 

In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a 
random variable X, we let E[X] and Var(X) be its expectation and variance respectively. To emphasize that 
the expectation is taken with respect to a random variable X with distribution P, we sometimes make this 
explicit by using a subscript, i.e.. Ex or Ep. 


1.5.1 Basic Bounds 

We start with the familiar Markov and Chebyshev inequalities. 

Proposition 1.1 (Markov’s inequality). Let X he a real-valued non-negative random variable. Then for 
any a > 0, we have 

Pr(ls: >a)< ffl. (1.37) 

If we let X above be the non-negative random variable {X — E[lf])^, we obtain Chebyshev’s inequality. 

Proposition 1.2 (Chebyshev’s inequality). Let X be a real-valued random variable with mean p, and variance 
cr^. Then for any b > 0, we have 

Pr {\X - p\>ba) < ^. (1.38) 

We now consider a collection of real-valued random variables that are i.i.d. In particular, let X" = 
{Xi,... ,Xn) be a collection of independent random variables where each Xi has distribution P with zero 
mean and finite variance a^. 


Proposition 1.3 (Weak Law of Large Numbers). For every e > 0, we have 


lim Pr 

n—>-oo 



= 0 . 


(1.39) 


Consequently, the average P converges to 0 in probability. 


This follows by applying Chebyshev’s inequality to the random variable 
conditions, the convergence to zero in (1.39) occurs exponentially fast. See, 
in gH Thm. 2.2.3]. 


P X]r=i under mild 

for example, Cramer’s theorem 
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Figure 1.2: Plots of $(?/) and $ ^(e) 


1.5.2 Central Limit-Type Bounds 

In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian 
as 


Af{x; fi, a^) = 


1 


72 




na^ 


(1.40) 


We will also denote this as JV{p,a‘^) if the argument x is unnecessary. A standard Gaussian distribution is 
one in which the mean fi = 0 and the standard deviation cr = 1. In the multivariate case, the pdf is 


A/'(x;/x, S) = 


7(277S 




(1.41) 


where x G A standard multivariate Gaussian distribution is one in which the mean is 0^ and the 

covariance is the identity matrix Ikxk- 

For the univariate case, the cumulative distribution function (cdf) of the standard Gaussian is denoted 

as y 

^{y) ■= ( A/’(a:; 0,1) dx. (1-42) 

J —OO 

We also find it convenient to introduce the inverse of <i> as 

4>“^(e) := sup {y G K : $(?/) < e} (1-43) 

which evaluates to the usual inverse for e G (0,1) and extends continuously to take values ±oo for e outside 


(0,1). These monotonically increasing functions are shown in Fig. 1.2 


If the scaling in front of the sum in the statement of the law of large numbers in (1.39) is instead of 


7 the resultant random variable converges in distribution to a Gaussian random variable. As 
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in Proposition 
variance a^. 


1.3 


let X” be a collection of i.i.d. random variables where each Xi has zero mean and finite 


Proposition 1.4 (Central Limit Theorem). For any a S K, we have 


lim Pr 

n—>-oo 




< a 


$(a). 


(1.44) 


In other words, 


n 

(Ty/n ^ 
1—1 


Z 


(1.45) 


where —> means convergence in distribution and Z is the standard Gaussian random variable. 


Throughout the monograph, in the evaluation of the non-asymptotic bounds, we will use a more quan¬ 
titative version of the central limit theorem known as the Berry-Esseen theorem [T7l[52]. See Feller O 
Sec. XVI.5] for a proof. 

Theorem 1.1 (Berry-Esseen Theorem (i.i.d. Version)). Assume that the third absolute moment is finite, 
i.e., T := E[|Xip] < oo. For every n gN, we have 


sup 

oGR 


Pr 


(Jy'n 


Xi < a \ — $(a) 


T 

~ a’^^pn 


(1.46) 


Remarkably, the Berry-Esseen theorem says that the convergence in the central limit theorem in (1.44) is 
uniform in a G M. Furthermore, the convergence of the distribution function of ^ Xi to the Gaussian 
cdf occurs at a rate of O(^). The constant of proportionality in the 0(-)-notation depends only on the 
variance and the third absolute moment and not on any other statistics of the random variables. 

There are many generalizations of the Berry-Esseen theorem. One which we will need is the relaxation of 
the assumption that the random variables are identically distributed. Let V” = {Xi ,..., V„) be a collection 
of independent random variables where each random variable has zero mean, variance af := E[Vj^] > 0 and 
third absolute moment Ti := E[|Vip] < oo. We respectively define the average variance and average third 
absolute moment as 


1 

n 


and 


2 = 1 


T := - Vr,. 

n ^ 

2=1 


(1.47) 


Theorem 1.2 (Berry-Esseen Theorem (General Version)). For every n gN, we have 


sup 

oGR 


Pr 


E^* 
2 = 1 


< a 


$(a) 



(1.48) 


Observe that as with the i.i.d. version of the Berry-Esseen theorem, the remainder term scales as O(^). 
The proof of the following theorem uses the Berry-Esseen theorem (among other techniques). This 
theorem is proved in Polyanskiy-Poor-Verdii [Ml Lem. 47]. Together with its variants, this theorem is 
useful for obtaining third-order asymptotics for binary hypothesis testing and other coding problems with 
non-vanishing error probabilities. 

Theorem 1.3. A ssume the same setup as in Theorem \1.2\ For any 7 > 0^ we have 


exp 




<2(pi 


12 r\ exp(— 7 ) 
cr^ / ay/n 


(1.49) 
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It is trivial to see that the expectation in (1.49) is upper bounded by exp(— 7 ). The additional factor of 
{a^/n)~^ is crucial in proving coding theorems with better third-order terms. Readers familiar with strong 
large deviation theorems or exact asymptotics (see, e.g., [231 Thms. 3.3 and 3.5] or |43l Thm. 3.7.4]) will 
notice that (1.49) is in the same spirit as the theorem by Bahadur and Ranga-Rao m- There are two 
advantages of (1.49) compared to strong large deviation theorems. First, the bound is purely in terms of cr^ 


and T, and second, one does not have to differentiate between lattice and non-lattice random variables. The 


disadvantage of (1.49) is that the constant is worse but this will not concern us as we focus on asymptotic 


results in this monograph, hence constants do not affect the main results. 

For multi-terminal problems that we encounter in the latter parts of this monograph, we will require 
vector (or multidimensional) versions of the Berry-Esseen theorem. The following is due to Gotze |63j . 


Theorem 1.4 (Vector Berry-Esseen Theorem I). Let X 
with zero mean. Let 

1 


,X!^ be independent -valued random vectors 


Qk _ yk 

■ 


(1.50) 


Assume that has the following statistics 


Cov(5^) = E[5^(,5^)'] =Ifexfe, and ^ E[l|Vf Hi]. 


(1.51) 


Let be a standard Gaussian random vector, i.e., its distribution is Af{0^,lkxk)- Then, for all n S N, we 


have 


sup |Pr (S^ e - Pr I < 


(1.52) 


where is the family of all convex subsets of , and where Ck is a constant that depends only on the 
dimension k. 


Theorem 1.4 can be applied for random vectors that are independent but not necessarily identically 
distributed. The constant Ck can be upper bounded by if the random vectors are i.i.d., a result by 

Bentkus [15] . However, its precise value will not be of concern to us in this monograph. Observe that the 
scalar versions of the Berry-Esseen theorems (in Theorems [Tt] and |1.2| ) are special cases (apart from the 
constant) of the vector version in which the family of convex subsets is restricted to the family of semi-infinite 
intervals (— 00 , a). 

We will frequently encounter random vectors with non-identity covariance matrices. The following mod¬ 


ification of Theorem 1.4 is due to Watanabe-Kuzuoka-Tan [1771 Cor. 29]. 


Corollary 1.1 (Vector Berry-Esseen Theorem II). Assume the same setup as in Theorem I .4 
Cov(S^) = V, a positive definite matrix. Then, for all n S N, we have 

sup |Pr e <r) - Pr (Z'= G<^)\< g ^3 


except that 


(1.53) 


where Amin(V) > 0 is the smallest eigenvalue ofX. 

The final probability bound is a quantitative version of the so-called multivariate delta method dZl 
Thm. 5.15]. Numerous similar statements of varying generalities have appeared in the statistics literature 
(e.g., [2411175] '). The simple version we present was shown by MolavianJazi and Laneman [112] who extended 
ideas in Hoeffding and Robbins’ paper m Thm. 4] to provide rates of convergence to Gaussianity under 
appropriate technical conditions. This result essentially says that a differentiable function of a normalized 
sum of independent random vectors also satisfies a Berry-Esseen-type result. 
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Theorem 1.5 (Berry-Esseen Theorem for Functions of i.i.d. Random Vectors). Assume that X^,..., are 
-valued, zero-mean, i.i.d. random vectors with positive definite covariance Cov(V'f) and finite third absolute 
moment ^ := E[||Vf jH]. Let f(x) be a vector-valued function from to K* that is also twice continuously 
differentiable in a neighborhood o/x = 0. Let J € be the Jacobian matrix o/f(x) evaluated at x = 0, 

i.e., its elements are 


^ dfM) 

x=0 


(1.54) 


where i = 1,... ,l and j = 1,..., k. Then, for every n G N, we have 


sup 

‘g’GCi 




Pr (Z' e ifi) 



(1.55) 


where c> 0 is a finite constant, and is a Gaussian random vector in with mean vector and covariance 
matrix respectively given as 


E[Z'] = f(0), and Cov(Z') = 


JCov(Vf)J' 


(1.56) 


In particular, the inequality in (1.55) implies that 



) - f (0) j A AA (0, J Cov(V^)J') , 


(1.57) 


which is a canonical statement in the study of the multivariate delta method [1741 Thm. 5.15]. 


Finally, we remark that Ingber-Wang-Kochman m used a result similar to that of Theorem [1.5 


to 


derive second-order asymptotic results for various Shannon-theoretic problems. However, they analyzed the 
behavior of functions of distributions instead of functions of random vectors as in Theorem 0 
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Chapter 2 

Binary Hypothesis Testing 


In this chapter, we review asymptotic expansions in simple (non-composite) binary hypothesis testing when 
one of the two error probabilities is non-vanishing. We find this useful, as many coding theorems we en¬ 
counter in subsequent chapters can be stated in terms of quantities related to binary hypothesis testing. For 
example, as pointed out in Csiszar and Korner [39l Ch. 1], fixed-to-hxed length lossless source coding and 
binary hypothesis testing are intimately connected through the relation between relative entropy and entropy 


in (1.18). Another example is in point-to-point channel coding, where a powerful non-asymptotic converse 


theorem |152[ Eq. (4.29)] |123L Sec. III-E] |1641 Prop. 6] can be stated in terms of the so-called e-hypothesis 
testing divergence and the e-information spectrum divergence (cf. Proposition 4.4). The properties of these 


two quantities, as well as the relation between them are discussed. Using various probabilistic limit theo¬ 
rems, we also evaluate these quantities in the asymptotic setting for product distributions. A corollary of the 
results presented is the familiar Chernoff-Stein lemma [3SJ Thm. 1.2], which asserts that the exponent with 
growing number of observations of the type-II error for a non-vanishing type-I error in a binary hypothesis 
test of P against Q is the relative entropy D{P\\Q). 

The material in this chapter is based largely on the seminal work by Strassen |1521 Thm. 3.1]. The 
exposition is based on the more recent works by Polyanskiy-Poor-Verdii |1231 App. C], Tomamichel-Tan |1641 
Sec. Ill] and Tomamichel-Hayashi |1631 Lem. 12]. 


2.1 Non-Asymptotic Quantities and Their Properties 

Consider the simple (non-composite) binary hypothesis test: 


Ho : ^ 
Hi: Z 


P 

Q 


( 2 . 1 ) 


where P and Q are two probability distributions on the same space Z. We assume that the space Z is finite 


to keep the subsequent exposition simple. The notation in (2.1) means that under the null hypothesis Hq, the 


random variable Z is distributed as P S i3^{Z) while under the alternative hypothesis Hi, it is distributed 
according to a different distribution Q G i^(Z). We would like to study the optimal performance of a 
hypothesis test in terms of the distributions P and Q. 

There are several ways to measure the performance of a hypothesis test which, in precise terms, is a 
mapping S from the observation space Z to [0,1]. If the observation z is such that S(z) « 0, this means the 
test favors the null hypothesis Hq. Conversely, i5(z) « 1 means that the test favors the alternative hypothesis 
Hi (or alternatively, rejects the null hypothesis Hq). If S(z) G {0,1}, the test is called deterministic, otherwise 
it is called randomized. Traditionally, there are three quantities that are of interest for a given test 5. The 
first is the probability of false alarm 


Pfa:=^<5(z)P(z) = Ep[5(Z)]. 


( 2 . 2 ) 
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The second is the probability of missed detection 


Pmd := E (1 - = Eq [1 - 5{Z)\. 


z^Z 


The third is the probability of detection, which is one minus the probability of missed detection, i.e., 


Pd ■.= Y.5{z)Q{z) = £q[5{Z)]. 


(2.3) 


(2.4) 


z^Z 


The probability of false alarm and miss detection are traditionally called the type-I and type-II errors re¬ 
spectively in the statistics literature. The probability of detection and the probability of false alarm are also 
called the power and the significance level respectively. The “holy grail” is, of course, to design a test such 
that PpA = 0 while Pd = 1 but this is clearly impossible unless P and Q are mutually singular measures. 

Since misses are usually more costly than false alarms, let us fix a number e € (0,1) that represents 
a tolerable probability of false alarm (type-I error). Then define the smallest type-II error in the binary 


hypothesis test (2.1) with type-I error not exceeding e, i.e.. 


/3i-,(P,Q):= inf 

<5:Z->-[0,l] 


{Eq[ 1-5(Z)] :Ep[<5(Z)] <e}. 


(2.5) 


Observe that Ep[5(Z’)] < e constrains the probability of false alarm to be no greater than e. Thus, we are 
searching over all tests d satisfying Ep [i5(.^)] < e such that the probability of missed detection is minimized. 
Intuitively, /3i_e(P, Q) quantifies, in a non-asymptotic fashion, the performance of an optimal hypothesis 
test between F and Q. 

A related quantity is the e-hypothesis testing divergence 

Dl{P\\Q) ■.= (2.6) 
\ — e 

This is a measure of the distinguishability of P from Q. As can be seen from ( |2.6| ), /3i_e(P, Q) and Df^{P\\Q) 
are simple functions of each other. We prefer to express the results in this monograph mostly in terms of 
Df,{P\\Q) because it shares similar properties with the usual relative entropy D{P\\Q), as is evidenced from 
the following lemma. 

Lemma 2.1 (Properties of Dff). The e-hypothesis testing divergence satisfies the positive definiteness con¬ 
dition I4ti[ Prop. 3.2], i.e., 

Di{P\\Q)>Q. (2.7) 

Equality holds if and only if P = Q. In addition, it also satisfies the data processing inequality im Lem. 1], 
i.e., for any channel W, 

DliPWWQW) < D],iP\\Q). (2.8) 

While the e-hypothesis testing divergence occurs naturally and frequently in coding problems, it is usually 
hard to analyze directly. Thus, we now introduce an equally important quantity. Define the e-information 
spectrum divergence D^{P\\Q) as 


DI{P\\Q) ■.= sup\ReR: P{ \z G Z -.log 




Pjz) 

Q{z) 


'}) 


< R}] <e 


(2.9) 


Just as in information spectrum analysis m, this quantity places the distribution of the log-likeli hoo d ratio 
log Q(z) (where Z ^ P), and not just its expectation, in the most prominent role. See Fig. 


interpretation of the definition in (2.9). 


2.1 


for an 


As we will see, the e-information spectrum divergence is intimately related to the e-hypothesis testing 


divergence (cf. Lemma 2.4). The former is, however, easier to compute. Note that if P and Q are product 
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Figure 2.1: Illustration of the e-information spectrum divergence D^{P\\Q) which is the largest point R* for 
which the probability mass to the left is no larger than e. 


P(Z) 

measures, then by virtue of the fact that log is a sum of independent random variables, one can estimate 
the probability in (2.91 using various probability tail bounds. This we do in the following section. 

We now state two useful properties of D^{P\\Q). The proofs of these lemmas are straightforward and 
can be found in |164[ Sec. III.A]. 


Lemma 2.2 (Sifting from a convex combination). Let P S t^{Z) and Q = ^f.akQk be an at most 
countable convex combination of distributions Qk G with non-negative weights ak summing to one, 

i.e., J2k = 1- Then, 


D!{P\\Q) < inf 


Dl{P\\Qk)+log 



( 2 . 10 ) 


In particular. Lemma 2.2 tells us that if there exists some 7 > 0 such that Q{z) < jQ{z) for all z G Z 
then, 

DtiP\\Q)> DI{P\\Q)-log (2.11) 


Lemma 2.3 (“Symbol-wise” relaxation of D^). Let W and V be two channels from X to y and let P G 
ty‘{X). Then, 

Dl{P X IT||P xV)< sup DI(Wi-\x)\\V{-\x)). (2.12) 

xex 

One can readily toggle between the e-hypothesis testing divergence and the e-information spectrum 
divergence because they satisfy the bounds in the following lemma. The proof of this lemma mimics that 
of [Ml Lem. 12]. 

Lemma 2.4 (Relation between divergences). For every e G (0,1) and every 77 G (0,1 — e), we have 


DI{P\\Q)-log ^^<Di{P\\Q) 

< DI+'^{P\\Q)+log — 
V 


(2.13) 

(2.14) 


Proof. The following proof is based on that for |163l Lem. 12]. For the lower bound in (2.13), consider the 
likelihood ratio test 

5(z) := ijlog^l^ < yj, where 7 := Pf(PjjQ) - ^ (2.15) 


for some small ^ > 0. This test clearly satisfies Ep [(5(.^)] < e by the definition of the e-information spectrum 
divergence. On the other hand. 


Eq[1-6{Z)] 

= g(3Wl{log|0>i} 


(2.16) 
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(2.17) 


P{z 


< exp(- 7 )]l| log 

zGZ 

< Y1 ^(^)exp(-7) 


>7 


zGZ 

< exp(- 7 ) 

As a result, by the definition of Df^{P\\Q), we have 


DUPWQ) > 7 - log -^ = DliPWQ) - e - log ^ 


1 — e 


1 — s 


(2.18) 

(2.19) 

( 2 . 20 ) 


Finally, take ^ | 0 to complete the proof of (2.13). 

For the upper bound in (2.14), we may assume Df^{P\\Q) is finite; otherwise there is nothing to prove as 
P is not absolutely continuous with respect to Q and so {P\\Q) is infinite. According to the definition 
of Df^{P\\Q), for any 7 > 0, there exists a test 6 satisfying Ep[5{Z)\ < e such that 


(1-e) exp(-D^(P||Q)) 
>Eq[1-S{Z)] 

> Q(^)(i - 

z-P(z)<^Q{z) 


> 


> 


> 


1 

7 

1 

7 

1 

7 


P{z){l-5{z)) 

z-P{z)<^Q{z) 

Y^P{z){1-5{z))- Y. 

■ 2 z:P{z)>'iQ{z) 


1 -e-P 


Q[z) 


>7 


where (2.25) follows because Ep[(5(Z’)] < e. Now fix a small ^ > 0 and choose 

7 = exp(i7^''(P||Q)+^). 


Consequently, from (2.25), we have 


Di{P\\Q)<Dl+^{P\\Q)+^ 

P{{zlog ^^>Dl+^{P\\Q)+i)) 


- log 1 - 


I — € 


( 2 . 21 ) 

( 2 . 22 ) 

(2.23) 

(2.24) 

(2.25) 

(2.26) 

(2.27) 


By the definition of D^~^‘^{P\\Q), the probability within the logarithm is upper bounded by 1 
f I 0 completes the proof of (2.14) and hence, the lemma. 


e-rj. Taking 

□ 


2.2 Asymptotic Expansions 

In this section, we consider the asymptotic expansions of Zl^(pl"l||(5*'"^) and P® (Pl"l HQI"!) when and 
QI"! are product distributions, i.e.. 


P("l(z) :=[]P,(^,), and g("l(z) = J] Q,(z,), (2.28) 

2=1 2=1 
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for all z = {zi,..., Zn) S Z". The component distributions are not necessarily the same for 

each i. However, we do assume for the sake of simplicity that for each i, Pi Qi so D{Pi\\Qi) < oo. Let 
V{P\\Q) be the variance of the log-likelihood ratio between P and Q, i.e., 


V{P\\Q) :=Y.P{z) 

zGZ 


log 


Pjz) 

Q{^) 


2 


D{P\\Q) 


(2.29) 


This is also known as the relative entropy variance. Let the third absolute moment of the log-likelihood ratio 
between P and Q be 


T{P\\Q) ■.= Y.P{z) 






(2.30) 


Also define the following quantities: 


1 

:= - Vd(P,||Q,), 

n 

2 = 1 
1 "" 

Vn.= -y^V{Pi\\Qi), and 
2 = 1 

1 

T^-.= -Y,T(pm^)■ 


(2.31) 

(2.32) 

(2.33) 


The first result in this section is the following: 

Proposition 2.1 (Berry-Esseen bounds for I?®). Assume there exists a constant > 0 such that Vn>V-. 
We have 


$-1 



6r„ 

\JnV^ 


Dl(^p(r^)\\Q(n)) - 

VnVn 



6r„ 


(2.34) 


Proof. Let Z” be distributed according to By using the product structure of P^"^ and in (2.281, 

'■(“.SS*")-® 

By the Berry-Esseen theorem in Theorem |1.2[ we have 


P^{Z^) 

Q^{z^) 


Pr f ^ log < P 1 — $ 


i=i 


Q^{Zi) 


R — nPr 


< 


The result immediately follows by using the bound Vn>V-. 


< 

(2.35) 

6T„ 

(2.36) 

y/nV^ 


□ 


A special case of the bound above occurs when Pi = P and Qi = Q for alH = 1,..., n. In this case, we 
write P" for P^") and similarly, Q” for One has: 

Corollary 2.1 (Asymptotics of P®). IfV{P\\Q) > 0, then 

Pf(P”||Q") = nD{P\\Q) + x/nB(P||Q)$-i(e) + 0(1). (2.37) 


Proof. Since B(P||Q) > 0 and T{P\\Q) < oo (because P <C Q), the term QTnj 
-% for some finite c > 0. By Taylor expansions. 


in (2.341 is equal to 


$-1 




$-^(e) -bO 



(2.38) 


which completes the proof. 


□ 


22 

























In some applications, it is not possible to guarantee that Vn is uniformly bounded away from zero (per 
Proposition 2.1). In this case, to obtain an upper bound on D^, we employ Chebyshev’s inequality instead 
of the Berry-Esseen theorem. In the following proposition, which is usually good enough to establish strong 
converses, we do not assume that the component distributions are the same. 


Proposition 2.2 (Chebyshev bound for D^). We have 


nVn 
1 — e' 


Proof. By the definition of the ^-information spectrum divergence, we have 

= max{p-,i:)+} 

where D~ and D'^ are defined as 


D := sup < nDn '■ 

£)+ := sup > nDn ■ ^ ^ 


P{z) 




'}) 


P(z) 

Q{z) 


4 ) 


< PM < e 


(2.39) 

(2.40) 

(2.41) 

(2.42) 


Clearly, D < nDn so it remains to upper bound D'^. Let R > nDn be hxed. By Chebyshev’s inequality, 

nVn 




{R - nDnf ■ 


(2.43) 


Hence, we have 


< sup < R> nDn : 1 — 


nVn 

(R-nDn)'- 


< e 


= nDn + 


nVn 
1 — e' 


Thus, we see that the bound on D^ dominates. This yields (2.39) as desired. 


(2.44) 

(2.45) 

□ 


Now we would like an expansion for similar to that for D^ in Corollary 2.1 The following was shown 
by Strassen [1521 Thm. 3.1]. 


Proposition 2.3 (Asymptotics of D^). Assume the conditions in Corollary 2.1 The following holds: 


DiiP^WQ^) = nD{P\\Q) + ^nV{P\\Q)<^-\e) + ^ logn + 0(1). 


(2.46) 


As a result, in the asymptotic setting for identical product distributions, D^{P^\\Q'^) exceeds P®(P"||(5") 


by ^ logn ignoring constant terms, i.e., 

PS(P”||Q") = Pf (P"||Q") + hogn + 0(1). 

Proof. Let us first verify the upper bound. Let ry in the upper bound of Lemma 
Now, for n large enough (so ^ < 1 — e), combine this upper bound with Corollary 

P|(P"|1Q") < pf* (P"||g") + i logn + log(l - e) 

= nP(P||Q) -I- ^/nV{P\\Q)^~^ -I- hogn -I- 0(1) 


2.4 


2.1 


(2.47) 

be chosen to be 

y/n 

to obtain that 


(2.48) 

(2.49) 
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Applying a Taylor expansion to the last step and noting that V[P\\Q) < oo because P <^Q yields the upper 
bound in (2.461. 

The proof of the lower bound in (2.46) is slightly more involved. Observe that if we naively employed 


(2.13) to lower bound D^{P'^\\Q'^) with Ilg(P”||(5") — log the third-order term would be 0(1) instead 
of the better | logu -|- 0(1). The idea is to propose an appropriate test for and to use Theorem 


Consider the likelihood ratio test 


1.3 


J(z) :=0 log^<7 


0”(z) - ' J 

Define := V{P\\Q) and T := T{P\\Q). Also define the i.i.d. random variables Ui := \ogP{Zi) — logQ(Zi), 
1 < i < n, each having variance cr^ and third absolute moment T. Consider, the expectation of 1 — 5{Z'^) 
under the distribution Q": 


Eq.[1-<5(Z”)] 

= ^Q"(z)l|logS>7 


Q"(z) 


^P"(z)exp ( - log 


P"(z) 

Q"(z) 




= Ef 


i=l 




2=1 


<2 


/ log2 1^\ exp(- 7 ) 

Vv^ cr^ j CTi/n 


where (2.54) is an application of Theorem 1.3 Now put 


7 := nD{P\\Q) + ^nV{P\\Q)^-^ (e - - 

V Vnl/(P|1Q)3 


An application of the Berry-Esseen theorem yields 


Ep[<5(Z")] =P( {z:^log 


2=1 


PjZi) 

Q{zi) 


'}) 


<7 <£• 


From (2.54), (2.56) and the definition of we have 


(2.51) 

(2.52) 

(2.53) 

(2.54) 

(2.55) 

(2.56) 


Pg(P”|iQ”) > 7 -blog (crVn) -bO(l) = 7 -b -logn-bO(l). 

The proof is concluded by plugging (2.55) into (|2.57) and Taylor expanding 4>“3(.) around £. 


(2.57) 

□ 


We remark that the lower bound in Proposition 2.3 can be achieved using deterministic tests, i.e., S can 
be chosen to be an indicator function as in (2.50). Randomization is thus unnecessary. Also, one can relax 
the assumption that Q" is a product probability measure; it can be an arbitrary product measure. These 
realizations are important to make the connection between hypothesis testing and lossless source coding 
which we discuss in the next chapter. 

A corollary of Proposition 2.3 is the Chernoff-Stein lemma |25j quantifying the error exponent of the 
probability of missed detection keeping the probability of false alarm bounded above by e. 


Corollary 2.2 (Chernoff-Stein lemma). Assume the conditions in Corollary 2.1 and recall the definition of 
/3i_e m (2.5). For every e G (0,1), 


lim — log —- T--—= lim 

n-foo n /3i_e(P", Q”) ri-)-oo 


Pg(P"|lQ") 


= P(P||Q). 


(2.58) 
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Point-To-Point Communication 
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Chapter 3 

Source Coding 


In this chapter, we revisit the fundamental problem of fixed-to-fixed length lossless and lossy source com¬ 
pression. Shannon, in his original paper [141] that launched the field of information theory, showed that 
the fundamental limit of compression of a discrete memoryless source (DMS) P is the entropy H{P). For 
the case of continuous sources, lossless compression is not possible and some distortion must be allowed. 
Shannon showed in [144] that the corresponding fundamental limit of compression of memoryless source P 
up to distortion A > 0, assuming a separable distortion measure d, is the rate-distortion function 

R{P,A):= min I{P,W). (3.1) 

W(^&>{X\X):Ep^w[d{X,X)\<^ 

These first-order fundamental limits are attained as the number of realizations of the source (i.e., the block- 
length of the source) P tends to infinity. The strong converse for rate-distortion is also known and shown, 
for example, in [391 Ch. 7]. In the following, we present known non-asymptotic bounds for lossless and lossy 
source coding. We then fix the permissible error probability in the lossless case and the excess distortion 
probability in the lossy case at some non-vanishing e € (0,1). The non-asymptotic bounds are evaluated as 
n becomes large so as to obtain asymptotic expansions of the logarithm of the smallest achievable code size. 
These refined results provide an approximation of the extra code rate (beyond H{P) or R{P, A)) one must 
incur when operating in the finite blocklength regime. Finally, for both the lossless and lossy compression 
problems, we provide alternative proof techniques based on the method of types that are partially universal. 

The material in this chapter concerning lossless source coding is based on the seminal work by Strassen [1521 
Thm. 1.1]. The material on lossy source coding is based on more recent work by Ingber-Kochman |86| and 
Kostina-Verdii m- 

3.1 Lossless Source Coding: Non-Asymptotic Bounds 

We now set up the almost lossless source coding problem formally. As mentioned, we only consider fixed- 
to-fixed length source coding in this monograph. A source is simply a probability mass function P on some 
finite alphabet X or the associated random variable X with distribution P. See Fig. |3.1| for an illustration 
of the setup. 

An (M,e)-code for the source P G ^{X) consists of a pair of maps that includes an encoder f : X ^ 
{1,..., M} and a decoder ip : {1,..., M} ^ X such that the error probability 

P{{x e A : ifiifix)) ^ x}) < e. (3.2) 

The number M is called the size of the code (/, ip). 

Given a source P, we define the almost lossless source coding non-asymptotic fundamental limit as 

M*{P,e) := min{M £ N : 3 an (M, e)-code for P}. (3-3) 
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Figure 3.1: Illustration of the fixed-to-fixed length source coding problem. 


Obviously for an arbitrary source, the exact evaluation of the minimum code size M* {P, e) is challenging. 
In the following, we assume that it is a discrete memoryless source (DMS), i.e., the distribution P" consists 
of n copies of an underlying distribution P. With this assumption, we can find an asymptotic expansion of 
logM*^,^). 

The agenda for this and subsequent chapters will largely be standard. We first establish “good” bounds on 
non-asymptotic quantities like M* {P, s). Subsequently, we replace the source or channel with n independent 
copies of it. Finally, we use an appropriate limit theorem (e.g., those in Section 1.5) to evaluate the non- 
asymptotic bounds in the large n limit. 


3.1.1 An Achievability Bound 

One of the take-home messages that we would like to convey in this section is that fixed-to-fixed length 
lossless source coding is nothing but binary hypothesis testing where the measures P and Q are chosen 
appropriately. In fact, a reasonable coding scheme for the lossless source coding would simply be to encode 
a “typical” set of source symbols T C A’, ignore the rest, and declare an error if the realized symbol from 
the source is not in P. In this way, one sees that 


M*{P,e)< min ITI 
TGX:P(X\T)<s 


(3.4) 


This bound can be stated in terms of /3i_e(P, Q) or, equivalently, the e-hypothesis testing divergence 
Df^{P\\Q) if we restrict the tests that define these quantities to be deterministic, and also allow Q to be 
an arbitrary measure (not necessarily a probability measure). Let fj, be the counting measure, i.e.. 


fi{A):=\A\, yAcX. 


(3.5) 


Proposition 3.1 (Source coding as hypothesis testing: Achievability). Let e G (0,1) and P be any source 
with countable alphabet X. We have 

M*(P,e)</3i_,(P,M), (3.6) 


or in terms of the e-hypothesis testing divergence (cf. (2.6)), 


log M*{P,e) < -DUPy) - log - 

I — e 


(3.7) 


3.1.2 A Converse Bound 

The converse bound we evaluate is also intimately connected to a divergence we introduced in the previous 
chapter, namely the e-information spectrum divergence where the distribution in the alternate hypothesis Q 
is chosen to be the counting measure. 

Proposition 3.2 (Source coding as hypothesis testing: Converse). Let e G (0,1) and P be any source with 
countable alphabet X. For any rj G (0,1 — e), we have 

log MAP, e) > -Dl+AP\\h) - log (3.8) 

V 

This statement is exactly Lemma 1.3.2 in Han’s book m- Since the proof is short, we provide it for 
completeness. 
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Proof. By the definition of the e-information spectrum divergence, it is enough to establish that every 
(M, e)-code for P must satisfy 

£ + V> P(^{x ■■ P{x) < (3.9) 

for any r] G (0,1 — e). Let T '■= {x ■. P{x) < and let 5 := {a; : (p{f{x)) = x}. Clearly, |5| < M. We have 


Furthermore, 


P{T) < P{x \S) + P{Tr\S)<e + P{T n S). 

p(rns)= pm< ^ 


x^l~r]S 


x£l~r]S 


Uniting (3.10) and (3.11) gives (3.9) as desired. 


(3.10) 

(3.11) 

□ 


Observe the similarity of this proof to proof of the upper bound of in terms of in Lemma 2.4 


3.2 Lossless Source Coding: Asymptotic Expansions 


Now we assume that the source P" is stationary and memoryless, i.e., a DMS. More precisely, 

n 

P"(x) = J|P(a;,), (3.12) 

i=l 


We assume throughout that P{x) > 0 for all x G X. Shannon |141j showed that the minimum rate to achieve 
almost lossless compression of a DMS P is the entropy H{P). In this section as well as the next one, we 
derive finer evaluations of the fundamental compression limit by considering the asymptotic expansion of 
logM*(P"’,e). To do so, we need to define another important quantity related to the source P. 

Let the source dispersion of P be the variance of the self-information random variable — logP(X), i.e., 


U(P) := Var 


log 


P(X) 


xGX 


log 


P(x) 


2 


H{P) 


(3.13) 


Note that the expectation of the self-information is the entropy H{P). In Kontoyannis-Verdu [55], U(P) is 
called the varentropy. If V (P) = 0 this means that the source is either deterministic or uniform. 

The two non-asymptotic theorems in the preceding section combine to give the following asymptotic 
expansion of the minimum code size M*(P",e). 

Theorem 3.1. If the source P G i^(X) satisfies V{P) > 0, then 


Otherwise, we have 


logM*(P",e) =nP(P) - y/nU(P)$-i(e) - -logn + 0{l). 


logM*(P",e) = nP(P) + 0(1). 


(3.14) 


(3.15) 


Proof For the direct part of ( 3.14| ) (upper bound), note that the term — log in Proposition 
constant, so we simply have to evaluate Ph(P”||/r")Q From Corollary 
bound on Df^{P'^\\Q'^) can be achieved using deterministic tests, we have 


2.1 


IS a 


and its remark that the lower 


3.1 


P^(P"||m") = nP(P||M) + x/nU(P||M)$-i(£) + ^ logn + 0(1). (3.16) 

^Just to be pedantic, for any A C the measure fj,‘^{A) is defined as ~ l-^l 1 each x E 

Hence, has the required product structure for the application of Corollary|2.1| for which the second argument of 
is not restricted to product probability measures. 
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It can easily be verified (cf. (1.19)) that 


DiP\\fi) = -HiP), and ^(^11^) = ^^"). 

This concludes the proof of the direct part in light of Proposition |3.1| 

For the converse part of (3.14) (lower bound), choose rj = ^ so the term — log ^ in Proposition 


(3.17) 


us the —i logn term. Furthermore, by Proposition 2.1 and the simplifications in (3.17), 


3.2 


gives 


Ds 


- 


•(P"||;^") = -nH{P) + v/nF(P)$- 


e + 


0 ( 1 ). 


(3.18) 


A Taylor expansion of $ ^(•) completes the proof of (3.14). 


For (3.15), note that the self-information —logP(A) takes on the value H{P) with probability one. In 

P"(logP”(A”) <R)= 1{R > -nH{P)}. 


other words, 


(3.19) 


The bounds on logM*(P”,£) in Propositions |3.1| and |3.2| and the relaxation to the e-information spectrum 
divergence (Lemma [2.4|) yields (3.15). □ 


The expansion in (|3.14 ) in Theorem 3.1 appeared in early works by Yushkevich |190) (with o{^/n) in 


place of — ^logn but for Markov chains) and Strassen |152l Thm. 1.1] (in the form stated). It has since 
appeared in various other forms and levels of generality in Kontoyannis |93j , Hayashi Kostina-Verdu [S3, 

Nomura-Han m and Kontoyannis-Verdu |95j among others. 

As can be seen from the non-asymptotic bounds and the asymptotic evaluation, fixed-to-fixed length 
lossless source coding and binary hypothesis testing are virtually the same problem. Asymptotic expansions 
for and can be used directly to estimate the minimum code size M*(P”,e) for an e-reliable lossless 
source code. 


3.3 


Second-Order Asymptotics of Lossless Source Coding via the 
Method of Types 


Clearly, the coding scheme described in (3.4) is non-universal, i.e., the code depends on knowledge of the 


source distribution. In many applications, the exact source distribution is unknown, and hence has to be 
estimated a priori, or one has to design a source code that works well for any source distribution. It is a 
well-known application of the method of types that universal source codes achieve the lossless source coding 
error exponent |39l Thm. 2.15]. One then wonders whether there is any degradation in the asymptotic 
expansion of logM*(P",e) if the encoder and decoder know less about the source. It turns out that the 
source dispersion term can be achieved only with the knowledge of H{P) and V{P). However, one has to 
work much harder to determine the third-order term. For conclusive results on the third-order term for fixed- 
to-variable length source coding, the reader is referred to the elegant work by Kosut and Sankar |100L 1101) . 
The technique outlined in this section involves the method of types. 

Let M*{P,e) be the almost lossless source coding non-asymptotic fundamental limit where the source 
code (/, (f) is ignorant of the probability distribution of the source P, except for the entropy H{P) and the 
varentropy V{P). 

Theorem 3.2. If the source P € satisfies V{P) > 0, then 


logM*(P",e) < nH{P) - ^/nV{P)^-\e) + {\X\ - 1) logn+ 0(1). 


(3.20) 


The proof we present here results in a third-order term that is likely to be far from optimal but we present 
this proof to demonstrate the similarity to the classical proof of the fixed-to-fixed length source coding error 
exponent using the method of types [39l Thm. 2.15]. 
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Proof of Theorem \3.^ Let X = {ai,... ,ad} without loss of generality. Set M, the size of the code, to be 
the smallest integer exceeding 


exp 


iH{P) - ^/nV(P)^ ^(e -+ (d - 1) log(n + 1) 

V Wn/ 


(3.21) 


for some finite constant c > 0 (given in Theorem |1.5| ) . Let JC be the set of sequences in T” whose empirical 
entropy is no larger than 

1 . ,, (d - 1 ) log(n + 1 ) 


7 ;= - logM — 
n 


In other words, 


(3.22) 

(3.23) 


IC:= U Tq. 

Encode all sequences in /C in a one-to-one way and sequences not in K, arbitrarily. By the type counting 
lemma in (1.27) and Lemma [L2| (size of type class), we have 


\m E exp {nH{Q)) < (n -I- l)'^ ^ exp (ny) < M 


(3.24) 


so the number of sequences that can be reconstructed without error is at most M as required. An error 
occurs if and only if the source sequence has empirical entropy exceeding 7 , i.e., the error probability is 


p:=Pr(id(Px^) >7) 

where Px^ € ^n{X) is the random type of A" G X^. This probability can be written as 

p = Pr (f(P^„-P) >j), 
where the function / : —>■ K is defined as 

/(v) = id (v + P), 


(3.25) 

(3.26) 

(3.27) 


In (3.26) and (3.27), we regarded the type Px’^ and the true distribution P as vectors of length d = \X\, and 
H{w) = — Wj log Wj is the entropy. Note that the argument of /(•) in (3.26) can be written as 


px^-p=-y: 

n. 


2 = 1 


l{Xi = oi} - P(ai) 


l{Xi = Qd} - P{ad) 




(3.28) 


Z =1 


Since Uf := = oi} — P(ai),..., l{Xi = Od} — P{ad)]' for f = 1,..., n are zero-mean, i.i.d. random 

vectors, we may appeal to the Berry-Esseen theorem for functions of i.i.d. random vectors in Theorem |1. 5 1 
Indeed, we note that /(O) = H{P), the Jacobian of / evaluated at v = 0 is 


J = 


log 


1 

eP(ai) 


... log 


1 


^eP{ad)J\ ’ 

and the {s,t) G d}^ element of the covariance matrix of Uf is 

(CovfP'')! -{ Pi<^s)il-P{as)) s = t 

-P(a,)P(at) s^t ■ 

As such, by a routine multiplication of matrices, 

JCov(P{')J' = E(P), 


(3.29) 


(3.30) 


(3.31) 
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the varentropy of the source. We deduce from Theorem |1.5| that 


p < $ 




\^V{P)/n) 


(3.32) 


where c is a finite positive constant (depending on P). By the choice of M and 7 in (3.211-(3.22 1 , we see 
that p is no larger than e. □ 

This coding scheme is partially universal in the sense that H{P) and V{P) need to be known to be 
used to determine and the threshold 7 in (3.221, but otherwise no other characteristic of the source P 


is required to be known. This achievability proof strategy is rather general and can be applied to rate- 


distortion (cf. Section 3.6 1 , channel coding, joint source-channel coding 1170] . as well as multi-terminal 
problems |157j (cf. Section 6.3). 

The point we would like to emphasize in this section is the following: In large deviations (error exponent) 
analysis of almost lossless source coding, the probability of error in ( |3.25[ ) is evaluated using, for example, 
Sanov’s theorem [3^1 Ex. 2.12], or refined versions of it [331 Ex. 2.7(c)]. In the above proof, the probability 
of error is instead estimated using the Berry-Esseen theorem (Theorem |1.5| ) since the deviation of the code 
rate from the first-order fundamental limit H{P) is of the order <d{^) instead of a constant. Essentially, 
the proof of Theorem |3.2| hinges on the fact that for a random vector X" with distribution P”, the entropy 
of the type Px^i namely the empirical entropy satisfies the following central limit relation: 


V^(P(X") - H{P)) ^ Af (0, V{P)). 


(3.33) 


Finally, we note that the technique to bound the probability in ( |3.26| ) is similar to that suggested by Kosut 
and Sankar [1011 Lem. 1]. 


3.4 Lossy Source Coding: Non-Asymptotic Bounds 

In the second half of this chapter, we consider the lossy source coding problem where the source P G ^{X) 


does not have to be discrete. The setup is as in Fig. 3.1 and the reconstruction alphabet (which need not be 


the same as X) is denoted as X. For the lossy case, one considers a distortion measure d(x,x) between the 
source x G X and its reconstruction x G X. This is simply a mapping from X x X to the set of non-negative 
real numbers. 

We make the following simplifying assumptions throughout. 


(i) There exists a A such that i?(P, A), defined in (3.1), is finite. 


(ii) The distortion measure is such that there exists a finite set £ C X such that E[min,cg£ d(X, a;)] is finite, 
(hi) For every x G X, there exists an x G X such that d{x, x) = 0. 

(iv) The source P and the distortion d are such that the minimizing test channel W in the rate-distortion 


function in (3.1) is unique and we denote it as W*. 


These assumptions are not overly restrictive. Indeed, the most common distortion measures and sources, 
such as finite alphabet sources with the Hamming distortion d{x,x) = TL{x ^ i} and Gaussian sources with 
quadratic distortion d(x,x) = (x — i)^, satisfy these assumptions. 

An (M, A, d, e)-code for the source P G t^(X) consists of an encoder f : X —>■ {1,, M} and a decoder 
(/ 3 :{ 1 ,...,M}—such that the probability of excess distortion 


P{{x G A : d{x,(p{f{x))) > A}) < £. 
The number M is called the size of the code (/, ip). 


(3.34) 
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Given a source P, define the lossy source coding non-asymptotic fundamental limit as 

M*(P, A, d, e) := min {M £ N : 3 an (M, A, d, £:)-code for P}. (3.35) 

In the following subsections, we present a non-asymptotic achievability bound and a corresponding converse 
bound, both of which we evaluate asymptotically in the next section. 


3.4.1 An Achievability Bound 

The non-asymptotic achievability bound is based on Shannon’s random coding argument, and is due to 
Kostina-Verdu Ea Thm. 9]. The encoder is similar to the familiar joint typicality encoder [IHl Ch. 2] with 
typicality defined in terms of the distortion measure. To state the bound compactly, define the A-distortion 
ball around x as 

B^ix) := {x G X : d{x,x) < A}. (3.36) 

Proposition 3.3 (Random Coding Bound). There exists an {M, A,d,e)-code satisfying 


e < inf Ex 

Pir 




(3.37) 


Proof. We use a random coding argument. Fix Generate M codewords x{m),m = 1,... ,M 

independently according to P_y. The encoder finds an arbitrary m satisfying 


th £ argmind(a;,x(TO)). 

m 

The excess distortion probability can then be bounded as 

Pr (d(A, A) > A) = Ex [ Pr {d{X, A) > A | A)] 


r M 


= E 


X 


= E 


X 


= E 


X 


YIFt {d(X,X{m))> A\X) 

m—l 

M 

n (l-P^(PA(^(m 

m—l 

{i-PABa 


(3.38) 

(3.39) 

(3.40) 

(3.41) 

(3.42) 


Applying the inequality {1 — x)^ < e and minimizing over all possible choices of P^ completes the 
proof. □ 

3.4.2 A Converse Bound 

In order to state the converse bound, we need to introduce a quantity that is fundamental to rate-distortion 
theory. For discrete random variables with the Hamming distortion measure {d{x,x) = l{x i}), it 
coincides with the self-information random variable, which, as we have seen in Section |3.2[ plays a key role 
in the asymptotic expansion of log M* (P", e). 

The A-tilted information of x [Ml HZ] for a given distortion measure d (whose dependence is suppressed) 
is defined as 

j{x-, P, A) := - log E^. [ exp (A*A - X*d{x, A*)) 1 (3.43) 

where X* is distributed as PW* and 

aP(P, A') 


A* := -- 


5A' 


(3.44) 


A'=A 


The differentiability of the rate-distortion function with respect to A is guaranteed by the assumptions in 
Section 3.4 The term A-tilted information was introduced by Kostina and Verdii m- 
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Example 3.1. Consider the Gaussian source X ~ P(x) = N(x\ 0, with squared-error distortion measure 
d{x,x) = (x — x)^. For this problem, simple calculations reveal that 


j(x; P, A) = i log ^ 



(3.45) 


if A < , and 0 otherwise. 

One important property of the A-tilted information of x is that the expectation is exactly equal to the 
rate-distortion function, i.e., 

P(P,A) = Ex[j(X;P,A)]. (3.46) 

For the Gaussian source with quadratic distortion, the equality above is easy to verify from Example |3.1| 

In view of the asymptotic expansion of lossless source coding in Theorem |3.H we may expect that the 
variance of j{X;P, A) characterizes the second-order asymptotics of rate-distortion. This is indeed so, as we 
will see in the following. Other properties of the A-tilted information are summarized in |34l Lem. 1.4] and 
EZl Properties 1 & 2]. 

Equipped with the definition of the A-tilted information, we are now ready to state the non-asymptotic 
converse bound that will turn out to be amenable to asymptotic analyses. This elegant bound was proved 
by Kostina-Verdu [97l Thm. 7]. 

Proposition 3.4 (Converse Bound for Lossy Compression). Fix 7 > 0. Every {M, A, d, e)-code must satisfy 


£ > Pr {jiX'yP, A) > logM -I- 7 ) — exp(— 7 ). 


(3.47) 


Observe that this is a generalization of Proposition |3.2| for the lossless case. In particular, it generalizes 


the bound in (3.91. It is also remarkably similar to the Verdii-Han information spectrum converse bound [1691 
Lem. 4] for channel coding (reviewed in (4.10) in Section 4.1.2). This is unsurprising, as channel coding and 
rate-distortion are duals in many ways. We refer the reader to [97l Thm. 7] for the proof of Proposition 3.4 


3.5 Lossy Source Coding: Asymptotic Expansions 

As mentioned in the introduction of this chapter, the first-order fundamental limit for lossy source coding 
of stationary and memoryless sources P" is the rate distortion function R{P, A). We are interested in finer 
approximations of the non-asymptotic fundamental limit M*{P'^, A,d^'^\e) where P" is the distribution of 
a stationary, memoryless source X and the distortion measure : A" —)■ A” is separable, i.e., 

1 ” 

d{xi,Xi). (3.48) 

2=1 


for any (x,x) C x 

Let the variance of the A-tilted information of X be termed the rate-dispersion function 

A(P,A) := Var(j(A;P,A)). (3.49) 

Example 3.2. Let us revisit the Gaussian source with quadratic distortion in Example \ 3.1\ It is easy to 
verify that the variance ofj{X;P,A) is 

A(P,A) = i^ (3.50) 

if A < cT, and 0 otherwise. Hence, interestingly, the rate-dispersion function for the Gaussian source with 
quadratic distortion depends neither on the source variance nor the distortion A if A < cr^. This is 
peculiar to the Gaussian source with quadratic distortion. 
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Theorem 3.3. If P and d satisfy the assumptions in Section 3.A and, in addition, V(P,A) > 0 and 
Ep^PwAd{X,X*)^]<oo, 


logM*(P", = nR{P,A) - ^/nV{P, + 0{\ogn). 


(3.51) 


For the case of zero rate-dispersion function V{P,A) = 0, the reader is referred to [H3 Thm. 12]. The 
condition EpxPw[diX,X*)^] < oo is a technical one, made to ensure that the third absolute moment of 
j{X-, P, A) is finite for the applicability of the Berry-Esseen theorem. 

Proof sketch. For an i.i.d. source A", the A-tilted information single-letterizes because the optimum test 
channel in the rate-distortion formula also has the required product structure. Hence, 


j(A";P”,A) = ^j(A,;P,A). 

i^l 

Using the Berry-Esseen theorem, the probability in ( 3.47| ) can be lower bounded as 

Pr (j(A”; P", A) > log M + 7 ) > $ f f 

V V' 


' nR{P, A) — log M — 7 


'nV{P, A) 


K 

Jh 


(3.52) 


(3.53) 


where k is a function of the third absolute moment of j(A;P, A) which is finite by the assumption that 

K-f l/2^ 


E-pxPW* [d{X, A*)®] < 00 . Now set 7 = i log n and M to the smallest integer larger than 


exp ^nP(P, A) — ^ynV{P, A)<i) ^ ^ — 7 ^ . 


(3.54) 


By the non-asymptotic converse bound in Proposition |3.4[ w e find that e > e'. This implies that the number 
of codewords must not be smaller than that stated in (3.54), concluding the converse proof. 

For the direct part, we need a technical lemma [W 1 Lem. 2] relating the P^^-probability of a A-distortion 
ball to the A-tilted information. 


Lemma 3.1. There exist constants c,b,k>0 such that for all sufficiently large n, 


Pr log 


P|,(Pa(A")) 


> '^j{.Xi-,P,A) + b\ogn + c\ < -j=. 

i—l / 


(3.55) 


This lemma says that we can control the P^,-probability of A-distortion balls centered at a random 


source sequence A” in terms of the A-tilted information. Now define the random variable 

n 

Gn ■■= logM - ^ j(Ai;P, A) - 5logn - c. 


(3.56) 


i=l 


Choose the distribution P^ in the non-asymptotic achievability bound in Proposition 3.3 to be the product 
distribution P^^. Applying Lemma 3.11 we find that 


< -p ^ 


n 


In? 


1 


< Pr < log — + ^ Pr G„ > log — + ^. 


In? 


(3.57) 

(3.58) 

(3.59) 

where in the final step, we split the expectation into two parts depending on whether G„ > log or 
otherwise. Since G„ is a sum of i.i.d. random variables, the first probability can be evaluated using the 
Berry-Esseen theorem similarly to (3.53), and the second bounded above by 1. □ 
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Second-Order Asymptotics of Lossy Source Coding via the 
Method of Types 


3.6 


In this final section of the chapter, we briefly comment on how Theorem |3.3| can be obtained by means of 
a technique that is based on the method of types. Of course, this technique only applies to discrete (finite 
alphabet) sources so it is more restrictive than the Kostina-Verdu [37] method we presented. However, as 
with all results proved using the method of types, the analysis technique and the form of the result may be 
more insightful to some readers. The exposition in this section is due to Ingber and Kochman |8fi] . 

We make the simplifying assumption that the rate-distortion function i?(P, A) is differentiable with 


respect to A (guaranteed by the assumption (iv) in Section 3.4) and twice differentiable with respect to the 
probability mass function P. Ingber and Kochman |86| considered the fundamental quantity 


R'ix;P,A) := 


dR{Q,A) 


dQ{x) 


(3.60) 


Q=P 


It can be shown |96l Thm. 2.2] that R'{x] P, A) and the A-tilted information are related as follows: 

R'{x-,P, A) = j{x-,P, A)-loge. (3.61) 


Hence the expectation of R'{X; P, A) is the rate-distortion function R{P, A) up to a constant and its variance 
is exactly the rate-dispersion function V{P,A) in ( |3.49 1. 

A codeword x(to) S A" is simply an output of the decoder ip{m). The collection of all M codewords 
forms the codehook. Given a codebook C = {x(l),... ,x(M)}, we say that x G A" is A-covered by C if there 
exists a codeword x(m) € C such that d*^”^(x, x(to)) < A. 

The analysis technique in [86] is based on the following lemma. 

Lemma 3.2 (Type Covering). Por every type Q G there exists a codehook C := {x(l),... ,x(M)} C 

A” of size M and a function gidAl, \X\) such that every x G Tp is A-covered by C, and 

^\ogM<R{Q,A)+g,{\X\,\X\)^^ (3.62) 

n n 

Furthermore, let the code size M and a type Q G be such that logM < nR{Q,A). Then for every 

codebook C C A” of size M the fraction of Tp that is A-covered by C is at most 

exp (^-nR{Q,A) -\-logM - g 2 {\X\, |A|)logn^ (3.63) 


for some function 52 (|A|, |A|). 


The achievability part of the lemma in (3.62) is a refined version of the type covering lemma by Berger |161 
Sec. 6.2.1, Lem. Ij. A slightly weaker version of the lemma is also presented in Csiszar-Korner [531 Ch. 9] and 
was used by Marton [106] to find the error exponent for lossy source coding. The refinement comes about 
in the remainder term which is required for analyzing the setting in which the excess distortion 


probability is non-vanishing. The converse part in (3.63) is a corollary of Zhang-Yang-Wei [1911 Lem. 3]. 

We now provide an alternative proof of Theorem 3.3 using the type covering lemma. The crux of the 
achievability argument is to use the type covering lemma to identify a set of sequences of size M such that 
the sequences in A" that it manages to A-cover has probability approximately 1 — £ so the excess distortion 
probability is roughly e. The types of sequences in this set is denoted as 1C in the proof below. The P”- 
probability of K. can be estimated using the central limit relation similar to the analysis in the proof of 
Theorem |3.2| The converse argument hinges on the fact that the codebook given the achievability part of 
the type covering lemma is essentially optimal in terms of its size. 
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Proof sketch of Theorem \3.^ Roughly speaking, the idea in the achievability proof is to “encode” all se¬ 
quences in fb" whose empirical rate distortion function is no larger than some threshold. More specifically, 


encode (use codes prescribed by Lemma 3.2) sequences belonging to 


/C := U Tq, 

Qe3^„(X):R(Q,A)<'Y 


(3.64) 


where 


j:=R{P,A)- 




(3.65) 


By (3.62) and the type counting lemma, the size of 1C satishes the requirement in Theorem |3.3[ The resultant 
probability of excess distortion is Pr {R{Px ^, A) > 7 ) where Px" & ^n{X) is the (random) type of A” £ T”. 
Similarly to (3.33) for the lossless case, the following central limit relation holds: 


Vn{R{Px^,A)-R{P,A)) A AA(0,1/(P, A)). 


(3.66) 


The above convergence can be verified by using the Berry-Esseen theorem for functions of i.i.d. random 
vectors (Theorem |1.5[ ) per the proof of Theorem 3.2 Hence, probability of excess distortion is roughly e and 
the achievability proof is complete. 

The converse part follows from the fact that that we can lower bound the probability of the excess 
distortion event £a ■= {X"- , > A} as 


Pr {£a) > Pr {£a \ R{Px- , A) > P + V'n) Pr (P(Px", A) > P + V'n), 


(3.67) 


where P = 7 logM is the code rate and ipn is arbitrary. Now, by ( |3.63[ ), if the realized type of the source is 
Q € where R{Q, A) > R + 'tfn, then the fraction of the type class Tq that is A-covered is at most 


exp {—nR{Q, A) -|- nR — 32 log n) < exp {—nifn — 92 log n ). 


(3.68) 


Since all sequences in a type class are equally likely (Lemma 1.3), the probability of no excess distortion 
conditioned on the event {P(Px", A) > P -|- ipn} is at most - if ipn ■= (—52 + 1)1^£^. Thus 


Pr (^a) > (^1 - Pr (P(Px", A) > P + ^„). 


(3.69) 


For logM = nR chosen to be as in (3.51) in Theorem 3.3 the probability on the right is at least e — 0{^) 
by a quantitative version of the convergence in distribution in (3.66). □ 
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Chapter 4 

Channel Coding 


This chapter presents fixed error asymptotic results for point-to-point channel coding, which is perhaps 
the most fundamental problem in information theory. Shannon |141j showed that the maximum rate of 
transmission over a memoryless channel is the information capacity 

C{W)= max I(P,W). (4.1) 

This first-order fundamental limit is attained as the number of channel uses (or blocklength) tends to infinity. 
Wolfowitz [180] showed the strong converse for a large class of memoryless channels, which intuitively means 
that for codes with rates above C(W), the error probability necessarily tends to one. The contrapositive of 
this statement is that, even if we allow the error probability to be close to one (a strange requirement in 
practice), one cannot send more bits per channel use than what is prescribed by the information capacity in 

([ 43 . 

In the rest of this chapter, we revisit the problem of channel coding from the viewpoint of the error 
probability being non-vanishing. First, we define the channel coding problem as well as some important non- 
asymptotic fundamental limits. Next we derive bounds on these limits. Some of these bounds are intimately 
linked to ideas in and quantities related to binary hypothesis testing. We then evaluate these bounds for 
large blocklengths while keeping the error probability (either maximum or average) bounded above by some 
constant e S (0,1). We only concern ourselves with two classes of channels, namely the discrete memoryless 
channel (DMC) and the additive white Gaussian noise (AWGN) channel. We present second- and even 
third-order asymptotic expansions for the logarithm of the non-asymptotic fundamental limits. The chapter 
is concluded with a discussion of source-channel transmission and the cost of separation. 

The material in this chapter on point-to-point channel coding is based primarily on the works by 
Strassen [152] . Hayashi [76], Polyanskiy-Poor-Verdu [123] . Altug-Wagner [T2|, Tomamichel-Tan [164] and 
Tan-Tomamichel [159] . The material on joint source-channel coding is based on the works by Kostina- 
Verdii |M| and Wang-Ingber-Kochman [170] . 

4.1 Definitions and Non-Asymptotic Bounds 

We now set up the channel coding problem formally. A channel is simply a stochastic map W from an input 
alphabet X to an output alphabet y. For the majority of the chapter, we assume that there are no cost 
constraints on the codewords—the necessary changes required for channels with cost constraints (such as 
the AWGN channel) will be pointed out. See Fig. |4.1[ for an illustration of the setup. 

An (M, £)ave-code for the channel W € ^(yjX) consists of a message set M = {1,... ,M} and pair of 
maps including an encoder f : {1,..., M} X and a decoder ip : y {1,...,M} such that the average 
error probability 

^ W{y\ip-\m)\f{m)) < e. (4.2) 

m^M. 
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Figure 4.1: Illustration of the channel coding problem. 


An (M, e)niax-code is the same as an (M, e)ave-code except that instead of the condition in (4.2), the maximum 
error probability 

ineLxW{y\(p~^{m)\f{m))<e. (4.3) 

meM 

The number M is called the size of the code. 

We also dehne the following non-asymptotic fundamental limits 

M*^^(W,e) := max|M S N : 3 an (M,£:)ave-code for Wj, and (4.4) 

M^^^{W,e) := max{M e N : 3 an {M,e)max-code for Wj. (4.5) 

In the following, we will evaluate these limits when W assumes some structure, for example memoryless- 
ness and stationarity. Note that blocklength plays no role in the above definitions. In the sequel, we study 
the dependence of the fundamental limits on the blocklength n by inserting a “super-channel” kF" indexed 


by n in place of W in (4.4) and (4.5). Before we perform the evaluations, we state some bounds on M and 


e for arbitrary channels W. 


4.1.1 Achievability Bounds 

In this section, we state three achievability bounds. We evaluate these bounds for memoryless channels 
in the following sections. The first is Feinstein’s bound [53] stated in terms of the e-information spectrum 
divergence. 

Proposition 4.1 (Feinstein’s theorem). Let e G (0, 1) and let W be any channel from X to y. Then for 
any rj G (0,e), we have 


\ogM;^^^{W,e) > sup X IT||P X PW) - log -. 

Pe&’{x) V 


(4.6) 


The proof of this bound can be found in Han’s book |67| Lem. 3.4.1] and uses a greedy approach 
for selecting codewords. The average error probability version of this bound can be proved in a more 
straightforward manner using threshold decoding; cf. [551 Thm. Ij. The following is a slight strengthening 
of Feinstein’s theorem. 

Proposition 4.2. There exists an {M, e)max-code for W such that for any ^ > 0 and any input distribution 
PG ^{X), 




(4.7) 


where the distribution of {X,Y) is P x W in the first probability and the distribution of Y is PW in the 
second. 


The proof of this bound can be found in |123l Thm. 21]. It uses a sequential random coding technique 
where each codeword is chosen at random based on previous choices. Feinstein’s bound can be derived as 


a corollary to Proposition 4.2 by upper bounding the final probability in (4.7) by exp(— 7 ) and using the 
identihcation 7 = log ^. 

The previous two bounds are essentially threshold decoding bounds, i.e., we compare the likelihood ratio 
between the channel and the output distribution to a threshold 7 . For the average probability of error 
setting, one can compare the likelihood ratios of codewords directly and use maximum likelihood decoding 
to obtain the following bound. 
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Proposition 4.3 (Random Coding Union (RCU) Bound). There exists an {M,e)g,ve-code for W such that 
for any input distribution P G 


e<E 


min 


l,MPr 



W{Y\X) 

PW{Y) 


> log 


W{Y\X) 

PW(Y) 



(4.8) 


where {X,X,Y) is distributed as P{x)P{x)W{y\x). 

The proof of this bound can be found in [1231 Thm. 16], Note that the outer expectation is over X,Y 
while the inner probability is over X. Under certain conditions on a DMC and any AWGN channel, one can 
use the RCU bound to prove the achievability of | logn + 0(1) for the third-order term in the asymptotic 
expansion of log M*(1U", e). This is what we do in the subsequent sections. 


4.1.2 A Converse Bound 

The only converse bound we will evaluate asymptotically appeared in different forms in the works by Verdii- 
Han [1691 Lem. 4], Hayashi-Nagaoka [73 Lem. 4], Polyanskiy-Poor-Verdii |123L Sec. III-E] and Tomamichel- 
Tan |164L Prop. 6]. This converse bound relates channel coding to binary hypothesis testing. This relation, 
and its application to asymptotic converse theorems, can be traced back to early works by Shannon-Gallager- 
Berlekamp [146] and Wolfowitz [THU. The reader is referred to Dalai’s work [42l App. B] for an excellent 
modern exposition on this topic. 

Proposition 4.4 (Symbol-Wise Converse Bound). Let e G (0,1) and let W be any channel from X to y. 
Then, for any p G {0,1 — e), we have 

\ogM:,,{W,e) < inf sup Dl+\W{^x^) + log -. (4.9) 


If the codewords are constrained to belong to some set A C A (due to cost contraints, say), the supremum 
above is to be replaced by sup^.^^. 

The first part of the proof is analogous to the meta-converse in [1231 Thm. 27]. See also Wang-Colbeck- 
Renner m and Wang-Renner |173j . which inspired the conceptually simpler proof technique presented 
below. The bound in (4.9) is a “symbol-wise” relaxation of the meta-converse [1231 Thms. 28 and 31] and 
Hayashi-Nagaoka’s converse [771 Lem. 4]. The maximization over symbols allows us to apply our converse 
bound on non-constant-composition codes for DMCs directly. With an appropriate choice of Q, it allows 
us to prove a Tlogn -1-0(1) upper bound for the third-order asymptotics for positive e-dispersion DMCs 
(cf. Theorem 4.3). 


We remark that, in our notation, the information spectrum converse bound in Verdii-Han [1691 Lem. 4] 
takes the form 

logM;^ 3 (IU,e) < sup DI+^{P X W\\P x PW) + log - (4.10) 

Pei3^{x) V 

so it does not allow one to choose the output distribution Q. Observe the beautiful duality of the Verdii- 
Han converse with Feinstein’s direct theorem. The bound in Hayashi-Nagaoka [771 Lem. 4] (stated for 
classical-quantum channels in their context) affords this freedom and is stated as 


logM* {W,e) < inf sup D5+''(P x W\\P x Q) -I- log -. 

Q^3^(y) p^3f(x) 1 


(4.11) 


Hence, we see that the bound in Proposition |4.4[ is essentially a “symbol-wise” relaxation of the Hayashi- 


Nagaoka converse bound m Lem. 4] (applying Lemma [2.3[ ) as well as the meta-converse theorems in [1231 
Thms. 28 and 31]. 


Since the proof of Proposition 4.4 is short, we provide the details. 
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Proof of Proposition \4.4\ Fix an (|A^|, e)ave-code for W with message set M. and an arbitrary output distri¬ 
bution Q G ^(y). Let M and M be the sent message and estimated message respectively. Starting from a 

uniform distribution Pm over Ai, the Markov chain M X — ■ > Y —^ M induces the joint distribution 
Pmxym- data-processing inequality for (Lemma |2.1[ ), 

Dl{P X W\\P xQ) = DI{Pxy\\Px X Qy) > Di{P^^\\PM x Q^) (4.12) 


where Px = P and is the distribution induced by (p applied to Qy = Q- Moreover, using the test 
S(m,m) := l{m m}, we see that 


Ep„^[<5(M,M)] =Pr(M^M)<e 

(4.13) 

where (M, M) ^ Pmm above, and 


Ep„xq^[5(M,M)] 


= 'Yl PMirri)Qj^{m)l{rri ^ rh} 

{m,7h)GAixAi 

(4.14) 

= ^- Y Qm(^) Y PM{'m)l{m = m} 

mGA4 

(4.15) 

= 1- E 

(4.16) 


(4.17) 

Hence, Df^{Pj^^jp,j\\PM x Qm) > log \Ai\ -|-log(l — e) per the definition of the ^-hypothesis testing divergence. 
Finally, applying Lemmas |2.2| and |2.3| yields 

sup Dl+'^{W{-\x)\\Q) > Dl+'^i^P X IF P X Q) 

(4.18) 

> Dl{P X IF P X Q) — log -—^ 

" 7] 

(4.19) 

> log A1 - log -. 

V 

(4.20) 

This yields the converse bound upon minimizing over Q G ^(y). 

□ 


4.2 Asymptotic Expansions for Discrete Memoryless Channels 


In this section, we consider asymptotic expansions for DMCs. Recall that a DMC (without feedback) for 
blocklength n is a channel VF" G where the input and output alphabets are finite and the channel 

law satisfies 

n 

W^{y\x) = l[w{y,\x,), V (x,y) G df" x y". (4.21) 


Thus, the channel behaves in a stationary and memoryless manner. Shannon |141j found the maximum 
rate of reliable communication over a DMC and termed this rate the capacity C{W) given in (4.11. In this 
section, we derive refinements of this fundamental limit of communication by characterizing the first three 
terms in the asymptotic expansions of log , e) and log , e). Before we do so, we recall 

some fundamental quantities and define a few new ones. 
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4.2.1 Definitions for Discrete Memoryless Channels 


Recall that the conditional relative entropy for a fixed input and output distribution pair {P,Q) € x 

^(y) is D(W\\Q\P) J2x Pi^)D{W{-\x)\\Q). The mutual information is I{P,W) := D{PxW\\PxPW) = 
D{W\\PW\P). Moreover, C{W) is the information capacity defined in (4.1) and 


U{W) := {P G ^{X) : I{P, W) = C{W)} (4.22) 

is the set of capacity-achieving input distributions (CAIDs), respectively]^ The set of CAIDs is convex and 
compact in ^^‘{X). The unique [561 Cor. 2 to Thm. 4.5.2] capacity-achieving output distribution (CAOD) is 
denoted as Q* and Q* = PW for all P G 11. Furthermore, it satisfies Q*{y) > 0 for all y G 3^ [56l Cor. 1 to 
Thm. 4.5.2], where we assume that all outputs are accessible. 


Channel Dispersions 

Recall from (2.29) that the variance of the log-likelihood ratio log ^ under P is known as the divergence 

n 2 


variance, i.e.. 


V{P\\Q) := ^ P{x) 






(4.23) 


We also define the conditional divergence variance V(W\\Q\P) := P{'\^)\\Q) the conditional 
information variance V{P,W) := y(lFj|PlFjP). Define the unconditional information variance U{P,W) := 
V{P X WllP X PW). Note that 

V{P, W) = U{P, W) (4.24) 

for all P G n [1231 Lem. 62]. This is easy to verify because from [SSI Thm. 4.5.1], we know that all P G 11 
(i.e., CAIDs) satisfy 


Va;:P(a:)>0 D{W{-\x)\\PW) = C 
yx:Pix)=0 D{Wi-\x)\\PW) <C. 

The £-channel dispersion [1231 Def. 2] for e G (0,1) \ {^} is the following operational quantity. 


n—)-oo Tl 


$-i(e) 


This operational quantity was shown [1131 Eq. (223)] to be equal tc0 

^Vxain[W) ife<| 


Emax(VF) ife>i’ 


V,{W) := 

where Ei„i„(lF) :=minpgn E(P, W) and 14iax(fE) :=maxpgn E(P, W). 


(4.25) 

(4.26) 


(4.27) 


(4.28) 


Singularity 

The asymptotic expansions stated in Theorems |4.1| and |4.3| depend on the singularity of the channel. We 
say a DMC W G is singular if for all {x,y,z) G A xy x X with W{y\x)W{y\z) > 0, one has 

W{y\x) = W{y\z). A DMC that is not singular is called non-singular. 

^We often drop the dependence on W if it is clear from context. 

^Notice that for £ = we set Vs = Fmax- This is somewhat unconventional; cf. |123l Thm. 48]. However, doing so 
ensures that subsequent theorems can be stated compactly. Nonetheless, from the viewpoint of the normal approximation, it is 
immaterial how we choose Vi since = 0 (cf. 11231 after Eq. (280)]). 
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Note that if the DMC is singular, then 


W{y\x) 

for all {x,x'^y) ^ X x X x y. Intuitively, if a DMC is singular, checking feasibility is, in fact, optimum 
decoding. That is, given a codebook C := {x(l),...,x(M)}, we decide that m G M} is sent if, given 

the channel output y, it uniquely satisfies 

n 

IT”(y|x(m)) =Y]_W{yi\xi{m)) > 0. (4.30) 

It is known m that if W is singular, the capacity of W equals its zero-undetected error capacity. 

Example 4.1. Consider the binary erasure channel W with input alphabet X = {0,1} and output alphabet 
y = {0,e, 1} where e is the erasure symbol. The channel transition probabilities ofW are given by 

( 1-So y = 0 ( 0 y = 0 

W{y\0) = l So y = e and W{y\l) = < Si y = e (4.31) 

[O y = l [l-diy = l 

If So = Si = S > 0, then lE(e|0)lE(e|l) > 0 and lE(e|0) = lE(e|l) = <5, and so the channel is singular. If 
do Si, the ehannel is non-singular. 

Symmetry 

We say a DMC is symmetric isa pp. 94] if the channel outputs can be partitioned into subsets such that 
within each subset, the matrix of transition probabilities satisfies the following: every row (resp. column) is 
a permutation of every other row (resp. column). 


4.2.2 Achievability Bounds: Asymptotic Expansions 

In this section, we provide lower bounds to log , e) and \ogM^,,^,^{W^,e). We focus on the positive 

e-dispersion case. For other cases, the reader is referred to firm Thm. 47]. 


Independent and Identically Distributed (i.i.d.) Codes 


The following bounds are achieved using i.i.d. random codes. 

Theorem 4.1. If the DMC satisfies 14(IF) > 0, 

\ogM*^^,{W^,e) > nC + + 0(1). 

If in addition, the DMC is non-singular, 

logM:,,(W",e) > nO+ xAT4‘&"'(£) + \\ogn + 0{l). 


(4.32) 


(4.33) 


Theorem 4.1 says that asymptotically, log , s) is lower bounded by the Gaussian approximation 

nC + plus a constant term. In addition, under the non-singularity condition, one can say more, 

namely that log , e) is lower bounded by the Gaussian approximation plus I log n +0(1), known as 


the third-order term. The proof of the former statement in (4.32) uses the strengthened version of Feinstein’s 


theorem in Proposition 4. 2l while the proof of the latter statement in (4.33) requires the use of the RCU 


bound in Proposition 4.3 For a comparison of the third-order terms achievable by various achievabilty 


bounds, the reader is referred to Table |4rT| 

We will only provide the proof of the former statement, as the proof of latter is similar to the achievability 
proof for AWGN channels for which we show key steps in Section |4.3| For the proof of the latter statement 


(4.33), the reader is referred to |1191 Sec. 3.4.5]. 
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Bound 

Third-Order Term 

Feinstein -|- Const. Compo. (Thm. 4.2) 


-(|A|-i)logn+ 0 (l) 

Feinstein -|- i.i.d. (Rmk. 4.1) 

logn - 1 - 0 ( 1 ) 

Strengthened Feinstein-|-i.i.d. (Thm. 4.) 

) 

0 ( 1 ) 

RCU -E i.i.d. (Thm. 4.1) 

^ logn -E 0 ( 1 ) 


Table 4.1: Comparison of the third-order terms achievable by using various achievability bounds (in Section 
4.1.1 1 or requirements on the code (such as constant composition). The | logn -|- 0(1) that is achieved by 


evaluating the RCU bound holds only for the class of non-singular DMCs. 


Proof of (4.32). We specialize the strengthened version of Feinstein’s result in Proposition 4.2 

that achieves Id 


to be the n-fold product of a CAID PJ 
the Berry-Esseen theorem as 


Choose Pjc-i 


The first probability in (4.7) can be bounded using 




(p|-w)"(y") 


i=l 


< $ 


^ Pf,W{Y) - V 

(4.34) 

^ 6f 

(4.35) 

) ^ 


where T is the third absolute moment of log W (E|W) — log PJIF(F) and the variance is U (PJ, VV) which is 
equal to Id by (4.24). To bound the second probability in (4.7), we define 


V^:=V{Wi-\x)\\P^W), and 
W{Y\x) 


:= E 


log 


P*xW{Y) 


- D{W{-\x)\\Pf,W) 


(4.36) 

(4.37) 


Since the CAOD Pf^W is positive on y [SH Cor. 1 to Thm. 4.5.2], V- := vciuvx^x 14 > 0. It can also be 
shown similarly to |123[ Lem. 46] that r+ := < oo. Now, for all x G A", the second probability 

in (4.7) can be bounded as 


W"(F"|x) 

Pr log > 7 


(P|-W)"(y") 


= E 




W"(y"|x) 

^ (P4W)"(r") ^ ^ 




W-(.|x) 


exp — log 


<2 


w”(y”|x) 

(P*W)P(Y^ 

/log 2 12 r+ \ exp(- 7 ) 

V-\/^ V- ) \JnY- 


W"(y”|x) 

^ (P4W)"(r") 


where the Hnal inequality is an application of Theorem |1.3[ Now choose 

7 := nC Y \/ntd<l>“^(e'), with 

%f 


e := e — 


/ I 12 T+ ' 

1 I 


+ 


(4.38) 

(4.39) 

(4.40) 

(4.41) 

(4.42) 


Also choose M = [exp( 7 )J. Substituting these choices into the above bounds completes the proof of (4.32). 

□ 
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Remark 4.1. We remark that if we use Feinstein’s theorem in Proposition \4-- 1\ (instead of its strengthened 
version in Proposition 4-^K o,nd the codebook is generated in an i.i.d. manner according to (PJ)", the third- 
order term would be logn+0(l). Indeed, letrj in Feinstein’s theorem be Then, the {e—ri)-information 
spectrum divergence can be expanded as 

Dl~'^{{P*xT X W II (Pxr X (Px^r) =nC-h + 0(1). (4.43) 

This follows the asymptotic expansion of ( Corollary \2. and the fact that U{Px,W) = V{Px,W) = 

Ve{W) similarly to ( |4.35 1 . Coupled with the fact that — log — \ logn, we see that the third-order term is 


(at least) logn + 0(1). 


Constant Composition Codes and Cost Constraints 

In many applications, it may not be desirable to use i.i.d. codes as we did in the above proof. For example 
for channels with additive costs, each codeword x(m), m = 1,..., M, must satisfy 


1 


-^5(x,(m)) <r 


(4.44) 


for some cost function 6 : A’ —>• [0,oo) and some cost constraint F > 0. In this case, if the type P G ^n(^) 
of each codeword x(m) is the same for all m and it satisfies 


Ep[6(X)] < F, 


(4.45) 


then the cost constraint in (4.44) is satisfied. This class of codes is called constant composition codes of type 


P. The Gaussian approximation can be achieved using constant composition codes. Constant composition 
coding was used by Hayashi for the DMC with additive cost constraints [za Thm. 3]. He then used this 
result to prove the second-order asymptotics for the AWGN channel m Thm. 5] by discretizing the real line 
increasingly finely as the blocklength grows. It is more difiicult to prove conclusive results on the third-order 
terms using a constant composition ensemble [98) . nonetheless it is instructive to understand the technique 
to demonstrate the achievability of the Gaussian approximation. Let , e) denote the maximum 

number of codewords transmissible over IF" with maximum error probability e using constant composition 
codes. 


Theorem 4.2. If the DMC satisfies V^fW) > 0, 

logM;;,,,,,(IF",e) >nC+ ^/nK^-^(e) - (^|T| - ^ logn + 0(1). 


(4.46) 


Proof sketch of Theorem \4.S\ We use Feinstein’s theorem (Proposition |4.1[ Choose a type P G that 

is the closest in the variational distance sense to achieving I 4 . By [431 Lem. 2.1.2], we know that 


Will ^ 


ITI 


(4.47) 


Then consider the input distribution in Feinstein’s theorem to be Px^i^), the uniform distribution over 7p, 
i.e., 

l{x G Tp} 


Pjf.(x) = 


\Tp\ 


Clearly such a code is constant composition. Now we claim that 

Px.IF"(y) < \‘P„{X)\{PWriy) 


(4.48) 


(4.49) 


44 









for all y G y^. To see this note that for x G 7p, 


Px^(x) = ^ < |^„(A’)|exp(-niJ(P)) = |^„(A’)|P"(x) 


(4.50) 


where the inequality follows from Lemma 1.2 and the final equality from Lemma 1.3 For x ^ 7p, (4.50) 


also holds as Px"(x) = 0. Multiplying (4.50) by lF"(y|x) and summing over all x yields (4.49). Let x be 
an arbitrary sequence in Tp, i.e., x is a sequence with type P. The (e — 77)-information spectrum divergence 
in Feinstein’s theorem can be bounded as 


Dl~'^{Pxr^ X IE" Px" X Px-IE") 


= Pr''(W^"(-|x) Px"iE”) 

(4.51) 

> P>r''(W^”('|x) II (PIE)”) - log |.!^„(E)| 

(4.52) 

> nJ(P, IE) + ^nV{P,W)^-^{e - p - 


V y/nV{P,WfJ 


-log|.42„(E)| 

(4.53) 


where (4.51) follows from permutation invariance within a type class, and the change of output measure 
step in (4.52) uses the bound in (4.49) as well as the consequence of the sifting proper ty of in ( 2.11[ ). 
Inequality ( 4.53[ ) uses the lower bound in the Berry-Esseen bound on in Proposition |2.l|w ith T(P, W) := 
^^P{x)T{W{-\x)\\PW). Choose 77 in Feinstein’s theorem to be In view of (4.47), the following 

continuity properties hold for Ci, C2 > 0: 


\l{P,W)-I{P*x,W)\<c^n-\ 
^V{P,W)- ^V{P*x,W) 


and 


< C2n 


(4.54) 

(4.55) 


The bound in (4.54) follows because P 1—>■ I{P,W) behaves as a quadratic function near PJ while (4.55) 


follows from the Lipschitz-ness of P 1— y^V (P, W) near P^ because 14(IT) > 0. Combining these bounds 
with the type counting lemma in (1.27) and Taylor expansion of 4>“^(-) in (4.53) concludes the proof. □ 


We remark that if there are additive cost constraints on the codewords, the above proof goes through 


almost unchanged. The leading term in the asymptotic expansion in (4.46) would, of course, be the capacity- 
cost function [iUl Sec. 3.3]. The analogues of Fmin(lT) and t4iax(lT) that define the e-dispersion (cf. (4.28)) 
would involve the maximum and minimum over the set of input distributions P satisfying Ep[6(X)] < F. 
The third-order term remains unchanged. For more details, the reader is referred to [98) . 

In fact, the Gaussian approximation can be achieved with constant composition codes that are also 
partially universal. The only statistics of the DMC we need to know are the capacity and the e-dispersion. 
The idea is to compare the empirical mutual information of a codeword and the channel output /(x(m)Ay) to 
a threshold (that depends on capacity and dispersion), similar to maximum mutual information decoding [381 
[62] . This technique was delineated in the proof of Theorem |3.2| for lossless source coding. Essentially, in 
channel coding, it uses the fact that if X" is uniform over the type class 7p and E" is the corresponding 
channel output, the empirical mutual information /(X" A E") satisfies the central limit relation 


v^(/(E” A E”) - /(P, W)) A Af (0, E(P, IE)). 

4.2.3 Converse Bounds: Asymptotic Expansions 

The following are the strongest known asymptotic converse bounds. 

Theorem 4.3. If the DMC W satisfies 14(1E) > 0, 


(4.56) 


<nC+ ^/ni44> ^(e) + -logn-kO(l). 


(4.57) 
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If, in addition, the DMC is symmetric and singular. 


\ogM:,^{W\e) < nC+^Me<^-\e) + 0 ( 1 ). 


(4.58) 


The claim in (4.57) is due to Tomamichel-Tan [164) . and proved concurrently by Moulin |118j . while (4.58) 
is due to Altug-Wagner (T^]. The case V^{W) = 0 was also treated in Tomamichel-Tan |164j but we focus on 
channels with 14(IT) > 0. See [1641 Fig. 1] for a summary of the best known upper bounds on log M*^g(lT", e) 
for all classes of DMCs (regardless of the positivity of 14(1T)). 

Theorem 4.3 implies that logM*^g(lT", e) is upper bounded by the Gaussian approximation nC -I- 
^/nVe^~^{e) plus at most ^logn -|- 0(1). In general, this cannot be improved without further assump¬ 
tions on the channel because it can be shown that third-order term is | log n -1-0(1) for binary symmetric 


channels [1231 Thm. 52]. In fact, for non-singular channels, Theorem 4.1 shows that |logn -I- 0(1) is 


achievable in the third-order. The inequality in (4.57) improves on the results by Strassen [1521 Thm. 1.2] 
and Polyanskiy-Poor-Verdii [1231 Eq. (279)] who showed that the third-order term is upper bounded by 
{\X\ — i) logn -I- 0(1). The upper bound presented in (4.57) is independent of the input alphabet \X\. 


Furthermore, under the stronger condition of symmetry and singularity, the third-order term can be 
tightened to 0(1). In view of the first part of Theorem 4.1 the third-order term of these channels is 0(1). 

As the entire proof of Theorem 4.3 is rather lengthy, we will only provide a proof sketch of (4.57) for 
VminiW) > 0, highlighting the key features, including a novel construction of a net to approximate all output 
distributions. The following proof sketch is still fairly long, and the reader can skip it without any essential 
loss of any continuity. 


Proof sketch of (4.57). We assume that I4iin(lT) > 0. For DMC, the bound in Proposition 4.4 evaluates to 


< min max D|+''(lT”(-lx)||QW) -h log -. 




V 


(4.59) 


In the following, we choose r] = so the log term above gives our ^logn. It is thus important to find 

a suitable choice of G .^(T") to further upper bound the above. Symmetry considerations (see, e.g., 
[HU Sec. V]) allow us to restrict the search to distributions that are invariant under permutations of the n 
channel uses. 

Let C := |T|(|T| — 1) and let 7 > 0. Consider the following convex combination of product distributions: 

< 3 '”>(y) := ) E ‘ n O-to.) 

^ key i=l 


E 








(4.60) 


where F is a normalization constant that ensures J2y Q^^\y) = 

Qk{y) ■■= Q*{y) + 

VnC 

and the index set JC is defined as 


/C:= k = {fcj 


yJy&y 


ziyi 


: ^ fcy = 0, fcy > -Q*(2/)\/nc|. 

yey ^ 


(4.61) 


(4.62) 


See Fig. 4.2 The convex combination of output distributions induced by input types (FxkF)"' and the optimal 
output distribution (Q*)" (corresponding to k = 0) in is inspired partly by Hayashi |7i Thm. 2]. What 
we have done in our choice of Qk is to uniformly quantize the simplex along axis-parallel directions 

to form a net. The constraint that each k belongs to 1C ensures that each Qk is a valid probability mass 
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( 1 , 0 ) 


Q(i) 



Figure 4.2: Illustration of the choice of {QklkeK for y = {0,1}. Note that all probability distributions lie 
on the line Q{0) +<5(1) = 1 and each element of the net is denoted by Qk where k denotes some vector with 
integer elements. 


function. It can be shown that F < oo. Furthermore one can verify that for any Q G there exists a 

k e /C such that 

IIQ ~ QklU ^ ~j= (4.63) 

Vn 

so the net we have constructed is -dense in the fo-norm metric. 

Vn 

Let us provide some intuition for the choice of The first part of the convex combination is used to 

approximate output distributions induced by input types that are close to the set of CAIDs 11. We choose 
a weight for each element of the net that drops exponentially with the distance from the unique CAOD. 
This ensures that the normalization F does not depend on n even though the number of elements in the 
net increases with n. The smaller weights for types far from the CAIDs will later be compensated by the 
larger deviation of the corresponding mutual information from the capacity. This is achieved by the second 
part of the convex combination which we use to match the input types far from the CAIDs. This partition 
of input types into those that are close and far from 11 was also used by Strassen |152j in his proof of the 
second-order asymptotics for DMCs, 

Now we just have to evaluate Dg+’'(lF"(-|x)||(5("^) for all x S A”. The idea is to partition input 
sequences depending on their distance from the set of CAIDs. For this define 

:= |p g : mm \\P - P*\\2 < (4.64) 

for some small fj, > 0. The choice of /i will be made later. 

For sequences not in 11^, we pick {P^W)"^ from the convex combination (per Lemma |2.2[ ) giving 

D®+''(IF"(-|x)||Q(")) < Dy^{W"{-\:>i)\\{P^W)")+\0g{2\^yx)\). (4.65) 

Next the Chebyshev type bound in Proposition |2.2| yields 


DI+yw^{y)\\Q‘^^^) <nI{P^,W) + +log(2|.ij^„(A)|). (4.66) 

Since I{Px,W) < C < C (i.e., the first-order mutual information term is strictly bounded away from 
capacity), V{Px, W) is uniformly bounded [H71 Rmk. 3.1.1] and the number of types is polynomial, the right- 
hand-side of the preceding inequality is upper bounded by nC' F 0{y/n). This is smaller than the Gaussian 
approximation for all sufficiently large n as C' < C. 
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Now for sequences in 11^, we pick (5k(x) from the net that is closest to Py^W. Per Lemma 2.2 this gives 
i5-+')(W."(.|x)||QW) <f?-+''(lP"(.|x)||gi^(^))+7||k(x)||2+log(2F). (4.67) 

By the Berry-Esseen-type bound in Proposition |2.1[ we have 

i^^''(lE"(-|x)||Q(")) < ni?(VP||Qk(x)|i"x) 

+ 7l|k(x)||2 + log (2F) (4.68) 

for some finite k > 0. By the ^-denseness of the net, the positivity of the CAOD, and the bound D{Q\\Q) < 
\\Q — g||2/min^ Q{z) [301 Lem. 6.3] we can show that there exists a constant g > 0 such that 


^(W^IIQk(x)lPx) </(Px,M^)- 


lPxVE-gk(x)||i 


</(Px,VP) + —. 

nq 


(4.69) 


Furthermore by the Lipschitz-ness of Q i— y/V~(W\\Q\P) which follows from the fact that Q{y) > 0 for all 
y G y, 'vre have 

V^nP(fE||Qk(x)|/"x)-v/^(Px,fP)| </3||PxlP-gk(x)||2< (4.70) 

It is known from Strassen’s work |1521 Eq. (4.41)] and continuity considerations that for all Py G 11^, 


I{P^,W)<C-a^^ and ^/V{Py,W) - ^V{P\W) 




(4.71) 


where P* is the closest element in 11 to Py and ^ is the corresponding Euclidean distance. Let IIIEII2 be the 
spectral norm of W. By the construction of the net, 


|k(x)ii2 < \/nciigk(x) - g*ii2 

< \/<(IIQk(x) - P^Wh + \\P^W - Q*h) 


Uniting (4.68), (4.69), (4.70) and (4.74) and using some simple algebra completes the proof. 


(4.72) 

(4.73) 

(4.74) 

□ 


As can be seen from the above proof, the net serves to approximate all possible output distributions so 
that, together with standard continuity arguments concerning information quantities, the remainder terms 
resulting from (4.69), (4.70) and (4.74) are all 0(1). 

If we had chosen the more “natural” output distribution 


Q^^\y) = E 


l■^r^(A’)l 
^ i=i 




(4.75) 


in place of in (4.60), an application of Lemma 2.2 the type counting lemma in (1.27), and continuity 
arguments shows that the third-order term would be (lA"] — 4) logn-I-0(1). This upper bound on the third- 
order term was shown in the works by Strassen |152[ Thm. 1.2] and Polyanskiy-Poor-Verdii [1231 Eq. (279)]. 
The choice of output distribution in (4.75) is essentially due to Hayashi m- 
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4.3 Asymptotic Expansions for Gaussian Channels 

In this section, we consider discrete-time additive white Gaussian noise (AWGN) channels in which 

Y,=X, + Z,, (4.76) 

for each time i = 1,... ,n. The noise {Zi}f^-^ is a memoryless, stationary Gaussian process with zero mean 
and unit variance so the channel can be expressed as 

W{y\x) = M{y] X, 1) = —(4.77) 

V ZTT 

This is perhaps the most important and well-studied channel in communication systems. In the case of 
Gaussian channels, we must impose a cost constraint on the codewords, namely for every m, 

n 

\\f{Tn)\\l=^f,{mf <nsm (4.78) 

i=l 


where n is the blocklength, snr is the admissible power and fi(m) is the i-th coordinate of the m-th codeword. 
The signal-to-noise ratio is thus snr. We use the notation M*^g(lT"', snr, e) to mean the maximum number 
of codewords transmissible over W” with average error probability and signal-to-noise ratio not exceeding 
e S (0,1) and snr respectively. We define ,sm,£) in an analogous fashion. 

Define the Gaussian capacity and Gaussian dispersion functions as 


C(snr) := ^log(I-f snr), and V(snr) := log^ e • 

z 2(snr -1-1)^ 


(4.79) 


respectively. The direct part of the following theorem was proved in Tan-Tomamichel [I59j and the converse 
in Polyanskiy-Poor-Verdu |I23L Thm. 54]. The second-order asymptotics (ignoring the third-order term) was 
proved concurrently with [123] by Hayashi [71 Thm. 5]. Hayashi showed the direct part using the second- 
order asymptotics for DMCs with cost constraints (similar to Theorem 4.2) and a quantization argument 


(also see |153j l. The converse part was shown using the Hayashi-Nagaoka converse bound in (4.11) with the 
output distribution chosen to be the product CAOD. 

Theorem 4.4. For every snr S (0, c»), 


logM*^g(W"',snr, e) = nC(snr) -|- i/nV(snr)$ ^(e) + ^ logu + 0(1)- 


(4.80) 


For the AWGN channel, we see that the asymptotic expansion is known exactly up to the third order 


under the average error setting. The converse proof (upper bound of (4.80)) is simple and uses a specialization 
of Proposition |4.4| with the product GAOD. 

The achievability proof is, however, more involved and uses the RCU bound and Laplace’s technique for 
approximating high-dimensional integrals [15011162j . The main step establishes that if AT” is uniform on the 
power sphere {x : |jx||2 = nsnr}, one has 


Pr((X",y") G[b,b + y]\Y^ = y)< 




(4.81) 


where k does not depend on 6 G M and typical y, i.e., y such that Ijyljl ~ n(snr -|- 1). The estimate in 
(4.81) is not obvious as the inner product (X",F") is not a sum of independent random variables and so 
standard limit theorems (like those in Section 1.5) cannot be employed directly. The division by y/n gives 
us the i log n beyond the Gaussian approximation. 

If one is content with just the Gaussian approximation with an 0(1) third-order term, one can evaluate 
the so-called K/3-bound urn Thm. 25]. See [1231 Thm. 54] for the justification. The reader is also referred 
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to MolavianJazi-Laneman |112j for an elegant proof strategy using the central limit theorem for functions 
(Theorem |1.5[ ) to prove the achievability part of Theorem 4.4 under the average error setting with an 0(1) 
third-order term. 

It remains an open question with regard to whether 1 logn -1-0(1) is achievable under the maximum 
error setting, i.e., whether log.snr.e) is lower bounded by the expansion in (4.80). 


Proof. We start with the converse. By appending to a length-n codeword (possibly power strictly less than 
snr) an extra (n -I- 1)®‘ coordinate to equalize powers [1231 Lem. 39] [1451 Sec. X] (known as the n ^ n + 1 
argument or the Yaglom map trick [28l Ch. 9, Thm. 6]), we have that 

M:,,(W”,snr,£) < snr, e) (4.82) 

where gq(IT”, snr, e) is similar to M*yg(IT", snr, e), except that the codewords must satisfy the cost 
constraints with equality, i.e., ||/(m)||| = ||x(m)||| = nsnr. Since increasing the blocklength by 1 does not 
affect the asymptotics of log(IT", snr, e), we may as well assume that all codewords satisfy the cost 
constraints with equality. By Proposition |4.4| applied to n uses of the AWGN channel, we have 

logM*^ 3 (IT",snr,£)< inf sup Z)|+''(IP"(-|x)||Q("))-I log-. (4.83) 

Q'"’ ||x||i=nsnr 1 


Take rj = ^ so the final log term gives ^ logn. It remains to show that the {e -I- ? 7 )-information spectrum 
divergence term is upper bounded by the Gaussian approximation plus at most a constant term. 

For this purpose, we have to choose the output distribution S This choice is easy compared 

to the DMG case. We choose 

n 

= WQY{yi)^ where Qy (y) = W)?/; 0,1-I snr). (4.84) 

i=l 


One can then check that for every x G K" such that ||x ||2 = nsnr, 


n 

-^log 

n 


W{Y,\xi) 


i=l 


Var 


“X^log 

n 


Q*Y(y^) J 

W{Y,\x, 


2=1 


Q*YiY^ \ 


= C(snr), and 
V(snr) 


Then, by the Berry-Esseen-type bound in Proposition [2^ we have 

£,e+r,(^n(.|x)||Q(n)) < nC(snr) + A/nV(snr)$-i (e + r] + 


6T 


^ nV(snr)3 


(4.85) 

(4.86) 


(4.87) 


where T < oo is related to the third absolute moments of log ■ A Taylor expansion of $ ^(•) concludes 

the proof of the converse. 

Since the proof of the direct part is long, we only highlight some key ideas in the following steps. Details 
can be found in |159j . 

Step 1: (Random coding distribution) Consider the following input distribution to be applied to the RCU 
bound: 


PxA^) 


djUxUi - nsnr} 
A„(v'nsnr) 


(4.88) 


where (5{-} is the Dirac delta and A„(r) = r{n/ 2 ) ^ surface area of a sphere of radius-r in K". The 

power constraints are automatically satisfied with probability one. Let 


9 (x,y) log 


iy"(y|x) 

Px-W^iy) 


(4.89) 
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be the log-likelihood ratio. We will take advantage of the fact that 


Tl 1 

9(x,y) = 2 + (x,y) -nsnr- ||y||^ - logPx-W^”(y) 


(4.90) 


only depends on the codeword through the inner product (x,y) = fact, g(x, y) is equal to 

(x,y) up to a shift that only depends on ||y|||. 

Step 2: (RCU bound) The RCU bound (Proposition 4.3) states that there exists a blocklength-n code 
with M codewords and average error probability e' such that 

e' < E[min{l,M Pr(g(X",y”) > g(X”, F") | X", T”) }j, (4.91) 

where (X",X",F") - Px-(x)Px"(x)W’"(y|x). Let 

g{t, y) := Pr {q{X^, F") > 11 F" = y) (4.92) 

so the probability in ( 4.91[ ) can be written as 

Pr (g(X”, F") > q{X^, F”) | X", F") = g{qiX^, F”), F"). (4.93) 

By using Bayes rule, we see that 

git,y) = E[exp(- 9 (X",F-))l{q(X",F-) > t} | F" = y]. (4.94) 

Step 3: (A high-probability set) Now, we define a set of channel outputs with high probability 

T := |y : ^11x112 G [snr + 1 - (5„, snr -E 1 + J„]| (4.95) 

With Sn = it is easy to show that Px^W^{T) > 1 — Cn where = exp(—0(n^/^)). 

Step 4: (Probability of the log-likelihood ratio belonging to an interval) We would like to upper bound 
g{t,y) in (4.92) to evaluate the RCU bound. As an intermediate step, we consider estimating 

p(a,M|y):=Pr(g(X",F")G [a,a + M]|F” = y), (4.96) 

where a G K and g > 0 are some constants. Because F" is fixed to some constant vector y and ||X ”||2 is 
also constant, p{a,g, \ y) can be rewritten using ( |4.90 ) as 

p(a, /r I y) := Pr ((X", F") e [b,b + g]\Y^ = y), (4.97) 

for some other constant b that depends on a. So the crux of the proof boils down to understanding the 
behavior of the inner product (X",F") = input distribution in (4.88). The following 

important estimate is shown in |159j using Laplace approximation for integrals [15911162] . 

Lemma 4.1. For all large enough n (depending only on snr^, ally gF and all a G K, 


, I , pL 

P(a,M|y) < K - 


(4.98) 


where k > 0 also only depends only on the power snr. 

Step 5: (Probability that the decoding metric exceeds t for an incorrect codeword) We now return to 
bounding g{t,y) in (4.92). Again, we assume y G T. The idea here is to consider the second form of g{t,y) 
in (4.94[) and to slice the interval [t,oo) into non-overlapping segments {[t + lfj,,t + {I + l)g) : ^ G N U {0}} 
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where /i > 0 is a constant. Then we apply Lemma 4.1 to each segment. This is modeled on the proof of 
Theorem 1 1.3| Carrying out the calculations, we have 


g{t, y) exp(-t - ln)p{t + y) 


1=0 


< exp(—t — Ifi) ■ K ■ 

, /'n 


/=0 


exp(—t) K ■ fi 


1 - exp(-/r) ^/n ' 

Since /x > 0 is a free parameter, we may choose it to be log 2 yielding 


(4.99) 

(4.100) 

(4.101) 

(4.102) 


Step 6: (Evaluation of RCU) We now have all the necessary ingredients to evaluate the RCU bound in 

(4.103) 


(4.91). Consider, 


e'< E[min{l,M 5 (g(X",y”),y")}] 

< Pr(r" e r'^) 

+ E 


F" er 


Pr(r" gT). 


min{l,Mg(( 7 (X",F"),r")} 

The first term is bounded above by and the second can be bounded above by 

M 7 exp(-g(X",F"))' 


min < 1 


/n 


gT 


Pr(r" G T) 


(4.104) 


(4.105) 


due to ( |4.102 ) with t = q{X^, F”). We split the expectation into two parts depending on whether q{x, y) > 
\og{MJ/^yn) or otherwise, i.e.. 


^ M 7 exp(-q(X",F”)) 

i, - - _ 


Y'^ GT 


< Pr ( 7 (X",F”) < log 


Mj 


In 


F” G r 


M 7 


1 <i q(X", F") > log ^ } exp(-g(X", F")) 


F"Gr 


(4.106) 


(4.107) 


By applying (4.102) with t = \og{M^js/n), we know that the second term can be bounded above by yjy/n. 

Now let Qyiy) = A/’(j/;0,snr + 1) be the CAOD and Qy^iy) = nr=i*5y(?/i) Us n-fold memoryless 
extension. In Step 1 of the proof of Lem. 61 in |123j . Polyanskiy-Poor-Verdii showed that there exists a finite 
constant C > 0 such that 

sup ^ < C. (4.108) 

yej^ Qyr.[y) 

Thus, the first probability in ( 4.107[ ) multiplied by Pr(F" G T) can be upper bounded using the Berry-Esseen 
theorem and the statistics in (4.85)-(4.86) by 


' Q’r.iy”) - -/a ) 


< $ 


-nC(snr) 

■y/nV(snr) 


y/n 


(4.109) 


where /3 is a finite positive constant that depends only on snr. 
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Ghannel 

Third-Order Term 

Prefactor 

Non-singular, Symm. DMC 

i logn -1- 0(1) 

0 

^ ^ 


Singular, Symm. DMC 

0(1) 


AWGN 

4 logn -1- 0(1) 

0 


i^^(l + |iS'(fl)|)/2 J 


Table 4.2: Comparison between the third-order term in the normal approximation and prefactors in the 
error exponents regime Qn for various classes of channels. The reliability function [MiEaiTi] is denoted as 
E{R) and its derivative (if it exists) is E'{R). For the first row of the table, symmetry is not required for 
the third-order term to be equal to ^ logn -|- 0(1) (cf. (4.33) and (4.57l). 


Putting all the bounds together, we obtain 

'log^^-riC(^\ p ^ 


e' < 4) 


A/nV(^ 


snr 


^ H-^ + Cn ■ 

n Jn 


(4.110) 


Now choose M to be the largest integer satisfying 


logM < nC(snr) -I- VnV(snr)4> ^ (e — ^ | -|- ^ logn — log(70. (4-111) 

\ Vn J 2 


This choice ensures that e' < e. By a Taylor expansion of 4> ^(•), this completes the proof of the lower 
bound in (14.801). □ 


4.4 A Digression: Third-Order Asymptotics vs Error Exponent 
Prefactors 


We conclude our discussion on fixed error asymptotics for channel coding with a final remark. We have 
seen from Theorems |4.1| and |4.3| that the third-order term in the normal approximation for DMCs is given 
by I logn -|- 0(1) (r esp. 0(1)) for non-singular channels (resp. singular, symmetric channels). We have also 
seen from Theorem | 4.4| t hat the third-order term for AWGN channels is | logn -I- 0(1). These results are 
summarized in Table 14.^ 

In another line of study, Altug-Wagner nniiii] and Scarlett-Martinez-Guillen i Fabregas |135) derived 
prefactors in the error exponents regime for DMCs. In a nutshell, the authors were concerned with finding 
a sequence such that, for high rates (i.e., rates above the critical rate)j^ 

e* {W^, [exp(ni?)J) ^ g„ ■ exp ( — nE{R)), (4.112) 


where e* (W^, M) is the smallest average error probability of a code for the channel IF” with M codewords, 
and E{R) is the reliability function (or error exponent) of the channel [39l ESI liSj. The results are also 
summarized in Table |4.2| For the AWGN channel, it can be verified from Shannon’s work on the error 
exponents for the AWGN channel |145j that the prefactor is the same as that for non-singular, symmetric 
DMGs. Also see the work by Wiechman and Sason UTS]. Table |4.2| suggests that there is a correspondence 
between third-order terms and prefactors. A precise relation between these two fundamental quantities is 
an interesting avenue for future research. 


®We recall from Section 


1 . 3.1 


that a„ ~ bn iff anjbn —> 1 as n —^ oo. 
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Figure 4.3: Illustration of the joint source-channel coding problem. 


4.5 Joint Source-Channel Coding 


We conclude our discussion on channel coding by putting together the results and techniques presented in 
this and the previous chapter on (lossy and lossless) source coding. We consider the fundamental problem of 
transmitting a memoryless source over a memoryless channel as shown in Fig. 4.3 Shannon showed |141[I144] 
that as long as 

C{W) 


lim sup — < 

n—^oo ^ 




(4.113) 


where is the number of independent source symbols from P and n is the number of channel uses, the 
probability of excess distortion can be arbitrarily small in the limit of large blocklengths. The ratio fc„/n is 
also known as the bandwidth expansion ratio. We summarize known hxed error probability-type results on 
source-channel transmission in this section. 

The source-channel transmission problem is formally defined as follows: A {d, A, e)-code for source S with 
distribution P G t^(S) over the channel W G is a pair of maps including an encoder f ■. S ^ X 

and a decoder (p : y ^ S such that the probability of excess distortion 


P{s)W{{y : d{s, (p{y)) > A} | /(s)) < £. (4.114) 

sGS 


Again we assume there are no cost constraints on the channel inputs to simplify the exposition. If there are 
cost constraints, a natural coding strategy would involve constant compostion codes as discussed in Theorem 

1121 

In the conventional fixed-to-fixed length setting in which X and y are n-fold Cartesian products of 
the input and output alphabets respectively and S is the fc-fold Cartesian product of the source alphabet 
respectively, we may define the following: A (fc, n, d^^\ A, e)-code is simply a {d^^\ A, e)-code for the source 
with distribution P^ G and over the channel IF” G ^^(F”|A’”) such that the probability of excess 

distortion measure according to is no greater than s. 

The source-channel non-asymptotic fundamental limit we are interested in is defined as follows: 

k*(n, e) :=max {A: G N : 3 a {k,n, e)-code for {P^, IF")}. (4.115) 


This represents the maximum number of source symbols transmissible over the channel IF" such that the 
probability of excess distortion (at distortion level A) does not exceed £. One is also interested in the 
maximum joint source-channel coding rate which is ratio between the number of source symbols and the 
number of channel uses, i.e.. 


R*{n,d^^\A,e) 


k*{n, A, e) 


(4.116) 


n 


4.5.1 Asymptotic Expansion 

The main result of this section was proved independently by Kostina-Verdu [53] and Wang-Ingber-Kochman |170j 
(for the special case of transmitting DMSes over DMCs). 

Theorem 4.5. Assume the regularity conditions on the source and distortion as in Theorem \3.3\ Assume 
that W is a DMC with dispersion F(1F) = VminiW) = Fnax(lF) > 0. Then, there exists a sequence of 
{k, n, A, s)-codes for P^ and IF" if and only if 

kR{P, A)-nC(W) = ^ykV{P,A)-GnV(W)^-\e) + 0{logn). (4.117) 
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Figure 4.4: Illustration of the separation scheme for source-channel transmission 


Accordingly, by a simple rearrangement, one easily sees that 




CiW) 


V{W,P,A)^_, 


$-^(e)-hO 



where the rate-dispersion function is 


V{W,P,A) 


R{P, A)V{W) + C{W)V{P, A) 


(4.118) 


(4.119) 


We will not prove this theorem here, as the main ideas, based on new non-asymptotic bounds, have been 
detailed in previous asymptotic expansions. 

The intuition behind the result in Theorem |4.5| is perhaps more important. The non-asymptotic bounds 
that are evaluated very roughly say that a joint source-channel coding scheme with probability of excess 
distortion no larger than e exists if and only if 


Pr (.^n ^ Jk,n) “A ^ 


(4.120) 


where the random variables In and Jk,n are defined as 


1 iT"(y"|x) 

” n (P*lT)"(y-)’ 


and 


Jk,n ■■= -J(5^P^A) 
n 


(4.121) 


and X has type P € close to 11 C J^(T’), the set of CAIDs. The bound in (4.120) provides the 


intuition that erroneous transmission of the source occurs if and only if the information density random 
variable In of the channel is not large enough to support the information content of the source, represented 


by the A-tilted information Jk,n- We can now estimate the probability in (4.120) by using the central limit 
theorem for k + n independent random variables, and the fact that — J*, „ has hrst- and second-order 
statistics 


E[/„ - Jk,n] = C{W) - ^R{P, A), and 
n 

Var[/„ - = ^V{W) + ^V{P, A). 

n m 


This essentially explains the asymptotic expansions in Theorem 4.5 


(4.122) 

(4.123) 


4.5.2 What is the Cost of Separation? 


In showing the seminal result in ( 4.113[ ), Shannon used a separation scheme. That is, he first considers source 
compression to distortion level A using a source encoder /g and subsequently, information transmission over 
channel IT" using a channel encoder fc- To decode, simply reverse the process by using a channel decoder 
(fid and a source decoder ips- See Fig. |4.4| where m denotes the digital interface. While this idea of separation 
has guided the design of communication systems for decades and is first-order optimal in the limit of large 
blocklengths, it turns out that such is scheme is neither optimal from the error exponent^ perspective |35] 


^To be more precise, the suboptimality of separation in the error exponents regime occurs only when kR{P, A) < nC{W). In 
the other case, by analyzing the probability of no excess distortion, Wang-Ingber-Kochman cn] showed, somewhat surprisingly, 
that separation is optimal. 
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nor the fixed error setting. What is the cost of separation in when the error probability is allowed to be 


non-vanishing? By combining Theorem 3.3 (for rate distorion), Theorems 4.1 —4.3 (for channel coding), one 
sees that there exists a sequence of {k, n, , A, £)-codes for and W" satisfying 


kR{P, A) — nC(W) + O(logn) 

max ^{V/c1/(P,A)$-1(£s) + VrW(W)$-i(£c)}. 


> 


(4.124) 


Inequality (4.1241 suggests that we first compress the source up to distortion level A with excess distortion 
probability £s, then we transmit the resultant bit string over the channel W" with average error probability 
£c. In order to have the end-to-end excess distortion probability be no larger than £, one has to design the 
source and channel codes so that £s + Ec < e. 


Because the maximum in (4.124) is no larger than the square root term in (4.117), separation is strictly 


sub-optimal in the second-order asymptotic sense (unless either V{W) or V{P,A) vanishes). This is un¬ 
surprising because for the separation scheme, the source and channel error events are treated separately. 


while the (approximate) non-asymptotic bound in (4.120) suggests that treating the system jointly results 
in better performances in terms of both error and rate. 


56 












Part III 

Network Information Theory 
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Chapter 5 

Channels with Random State 


This chapter departs from a key assumption in usual channel coding (Chapter in which the channel 
statistics do not change with time. In many practical communication settings, one may encounter situations 
where there is uncertain knowledge of the medium of transmission, or where the medium is changing over 
time, such as a wireless channel with fading or memory with stuck-at faults. This situation may be mod¬ 
eled using a channel whose conditional output probability distribution depends on a state process. Other 
prominent applications include digital watermarking and information hiding CHI- A thorough review of the 
(first-order) results in channels with state (or side information) is available in the excellent books by Keshet, 
Steinberg and Merhav [51] and El Gamal and Kim [iSJ Ch. 7]. 

The state may be known at the encoder only, the decoder only, or at both the encoder and decoder. The 
capacity is known in these cases when the state follows an i.i.d. process and the channel is stationary and 
memoryless given the state. In this chapter, we review known fixed error probability results for channels 
with random state known only at the decoder, channels with random state known at both the encoder and 
decoder, Costa’s dirty-paper coding (DPC) problem [30], mixed channels [67] Sec. 3.3] and quasi-static single¬ 
input-multiple-output (SIMO) fading channels. Asymptotic expansions of the logarithm of the maximum 
code size are derived for each problem. 

We briefly mention some problems we do not treat in this chapter. The second-order asymptotics for 
the discrete memoryless Gel’fand-Pinsker [^ problem (where the state is known noncausally at the encoder 
only) has not been completely solved [17711188| so we do not discuss this beyond the Gaussian case (the 
DPC problem). We also do not discuss the case where the state is known causally at the encoder. Second- 
order asymptotic analysis has also not been performed for this problem first considered by Shannon |143j 
(i.e.. Shannon strategies). We leave out channels with non-memoryless state, for example, the Gilbert-Elliott 
channel [sni Eoi nm for which the second-order asymptotics (dispersion) are known |124j under various 
scenarios. Finally, our focus here is on channels with a random state. We do not explore channels that 
depend on a non-random (but unknown) state. This is also known as the compound channel, and the 
asymptotic expansion was derived by Polyanskiy |120j . 


5.1 Random State at the Decoder 


We warm up with the simple model shown in Fig. 5.1 Here there is a state distribution Ps S t^{S) on a 


finite alphabet S which generates an i.i.d. random state S, i.e., a discrete memoryless source (DMS). The 
channel IT is a conditional probability distribution from X x S to y. If the state process is i.i.d. and the 
channel is discrete, stationary and memoryless given the state, it is easy to see that the capacity is 

C'si-d(W,F’s) = max I(X-,Y\S) = max liX'YS). (5.1) 

PG^{X) PG^{X) 

The idea is to regard (Y,S) as the output of a new channel W{y,s\x) = Ps{s)W{y\x, s), and then to use 
Shannon’s result for the capacity of a DMC in (4.1). Analogously to the problems we treated previously, we 
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Figure 5.1: Illustration of the state at decoder problem 


define , Ps^, s) to be the maximum number of messages transmissible over the DMC IF" with 

i.i.d. state S'" ^ Ps^ known at the decoder and with average error probability not exceeding e G (0,1). We 
also let Ws{y\x) := W{y\x, s) denote the channel indexed by s G S. 

The following is due to Ingber and Feder [55] , 

Theorem 5.1. Assume that 14(114) > 0 for all s G S and 14(114) does not depen^ on e G (0,1). Then, 

\ogMI,_j,{W”,Ps^,s) 

= nCsi-uiW, Ps) + VnVsi-DiW,Ps)<i>-\s) + O(logn), (5.2) 

where the dispersion Fsi- d{W,Ps) is 

Vsi-MW, Ps) = Es[l/(IFs)] + Vars[C(IFs)] (5.3) 

and where (4(114) is the capacity of channel Wg G t3^{y\X). 

The proof is based on the fact that we can define a new channel W from T to F x S and so letting X 
be a random variable whose distribution P G is a CAID, we have 


Vsi-d(IF, Ps) 

WiY,S\X) 


= Var 


log 


= Var 


PxWiY,S), 
W(Yj^ 

^ PxWiY\S) 

= Es[y(IFs)]+Var[C'(IFs)] 


log 


= E 


Var 


Var 


W{Y\X,S) 
PxW{Y\S)_ 

W{Y\X,S) 


log 


PxW{Y\S) 


(5.4) 

(5.5) 

(5.6) 


where (5.51 follows from the law of total variance with the conditional distribution PxW{y\s) := J2x Pxix)W {y\x, s), 
and (5.6) follows from the definition of the capacity and dispersion of II4. 

The dispersion in (5.3) is intuitively pleasing: The term Es[F(Ws)] represents the randomness of the 
channels {II4 : s G S} given the state; the term Var5[(4(IF5)] represents the randomness of the state. 


5.2 Random State at the Encoder and Decoder 


The next model we will study is similar to that in the previous section. However, here the i.i.d. state is 

Again, let W G tY‘{y\X x 5) be a 


known noncausally at both the encoder and the decoder. See Fig. 5.2 


state-dependent discrete memoryless channel, stationary and memoryless given the state and let Ps G 
be a DMS. It is known [321 Sec. 7.4.1] that the capacity of this channel is 

C'si_ed(IF,Ps) = max I{X-Y\S). 

Px\s(il^(X\S) 


^(S) 


(5.7) 


^If the CAIDs of each Ws is unique, VsCWs) does not depend on e: E (0,1). 
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Figure 5.2: Illustration of the state at encoder and decoder problem 


Goldsmith and Varaiya [6T] used time sharing of the state sequence to prove the achievability part of ( |5.7[ ) . 
Essentially, their idea is to divide the message into |5| sub-messages (rate-splitting). Each of these sub¬ 
messages can be sent reliably if and only if its rate is smaller than I{X;Y\S = s) for some Fx|s(’|s) 
assuming that the state sequence S'" is strongly typical. Averaging I{X;Y\S = s) over Ps{s) proves the 
direct part of (5.7). Clearly, if there exists an optimizing distribution Px\s ( |5.7| ) such that F’x|s(’l®) 
not depend on s, then Csi-ED(fF, Pg) = Csi-d{W, Ps)- For example, if the set of channels {Ws : s G S} 
consists of binary symmetric channels with different crossover probabilities, PJf| 5 (’|s) is uniform for all s G S. 

In the spirit of this monograph, let Afgj_gQ(IF", Ps^i, e) be the maximum number of messages transmis¬ 
sible over the channel IF" with i.i.d. random state S'" ~ Ps^ known at both encoder and decoder and with 
average error probability not exceeding e G (0,1). 

The following is due to Tomamichel and Tan |165j . 

Theorem 5.2. Let W satisfy the assumptions in Theorem\5. 1\ Then, 


logMg*! 

-ED (IF",Ps«,£) 

= ''^C'si_ED(bF, Ps) + \/ nVsi_ED(bF, Ps)^~^{e) + O(logn), 


(5.8) 


where the dispersion Fsi_ED(bF, *5 the expression given in (5.3). 


While the appearance of Theorem |5.2| is remarkably similar to that of Theorem |5.1[ its justification is 
significantly more involved. We will not provide the whole proof here as it is long but only highlight the key 
steps in the sketch below. Before we do so, for a sequence s G S", denote Pg G ^n{S) as its type and define 


x(s):=^P,(s)C(IF,) = -^C(IF,.), and 


s^S 


1 


Ks) := ^ P.{s)V{Wg) = - ^ F(IF,J 


(5.9) 

(5.10) 


sGS i=l 

to be the empirical capacity and the empirical dispersion respectively. 


Proof sketch of Theorem 5.2 Suppose first that the state is known to be some deterministic sequence s G 
S" of type Pg. Denote the optimum error probability for a length-n block code with M codewords as 


£*(W'^,M,s). We know by a slight extension of the channel coding result (Theorems 4.1 and 4.3) to 
memoryless but non-stationary channels that 


e*(IT",M,s) = $ 


log M — nx(s) 
y/ni'is) 




(5.11) 


where the implied constant in the 0(-)-notation above is uniform over all strongly typical state types Pg. 
The optimum error probability when the state is random and i.i.d. is denoted as e*(IF",M) and it can be 
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written as the following expectation: 


M) = Eg. M, S'")]. 

Therefore, the analysis of the following expectation is crucial: 

"logM — nx(S")' 


$ 


A/ni/(S”) 


(5.12) 


(5.13) 


The analysis of (5.13) is facilitated by following lemmas whose proofs can be found in 


Lemma 5.1. The following holds uniformly in a £ 


$ 


a - 


- E 




xjsn Y 


^/Es[ViWs)]J 


= o 


/ logn 

V n 


(5.14) 


This lemma says that we can essentially replace the random quantity i^(S") in (5.13) with the deter¬ 
ministic quantity E 5 [y(W 5 )]. The next step involves approximating x(‘5'") in (5.13) with the true capacity 
Csi-ed( 1 L, Ps). 

Lemma 5.2. The following holds uniformly in a £ K.' 




= $ 


a — 


xjsn Y 


^/Es[V{Ws)]J _ 

- g — Csi-ED(bL, Pg) \ 
^Es[V{Ws)]+\/ars[C{Ws)i 


o( ^ 

/n 


(5.15) 


The idea behind the proof of this lemma is as follows: From (5.9), one can write x(‘S'") as an average of 
i.i.d. random variables C(Ws.). The expectation in (5.15) can then be written as 


< 1 ) 


a — C'si-ed( 1 L, Pg) / Vars[C'(lFs)] 


VEs[l^(lFs)] 


where 


tlfi .— . — ^ ^nd. Ei .— 

, /n 


Eg[y(bFg)] 


C{Wsf) — Csi-ed( 1 L, Pg) 
y/\lsrs[C{Ws)] 


(5.16) 


(5.17) 


Clearly, Ei are zero-mean, unit-variance, i.i.d. random variables and thus Jn converges in distribution to a 
standard Gaussian. Now, (5.15) can be established by using the fact that the convolution of two independent 
Gaussians is a Gaussian, where the mean and variance are the sums of the constituent means and variances. 
Combining Lemmas 5.1 and 5.2 with (5.11)-(5.131 completes the proof. □ 

Finally, we remark that by appropriate modifications to Lemmas |5.1| and |5.2[ Theorem |5.2| can be 
generalized to the case where the distribution of the state sequence follows a time-homogeneous and ergodic 
Markov chain |1651 Thm. 8]. 


5.3 Writing on Dirty Paper 


Costa’s “writing on dirty paper” result is probably one of the most surprising in network information theory. 
It is a special instance of the GeTfand-Pinsker problem whose setup is shown in Fig. 5.3 In contrast to 


the previous two sections, here the state (usually assumed to be i.i.d.) is known noncausally at the encoder. 
The capacity of the Gel’fand-Pinsker channel is 


Csi-E(lP,Pg)= ^ max I{U;Y)-I{U;S) 

Pu\s XtS— 


(5.18) 
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Figure 5.3: Illustration of the Gel’fand-Pinsker problem 


where the auxiliary random variable C/ can be constrained to have cardinality \U\ < min{|A’||5|, |3^| + |5| + 1}. 
A strong converse was proved by Tyagi and Narayan |166j . 

The Gaussian version of the problem, studied by Gosta [30], and called writing on dirty paper, is as 
follows. The output of the channel Y is the sum of the channel input X, a Gaussian state S ~ A/'(0, inr) and 
independent noise Z ~ Af{0, 1), i.e., 

Y, = X, + S, + Z„ Vi = l,...,n. (5.19) 

As usual, we assume that the codeword power is constrained to not exceed snr, i.e., 

1 ” 

-^A2<snr (5.20) 

with probability one. If the state is not known at either terminal, then the capacity of the channel is 

Cno-si(W", Ps) = C ■ (5-21) 

If the state is known at both terminals, the decoder can simply subtract it off and the channel behaves like 
an AWGN channel with signal-to-noise ratio snr. Thus, the capacity is 

Gsi-ED(bF,Ps) = C(snr). (5.22) 

Gosta’s showed the surprising result [30] that knowledge of the state is not required at the decoder for the 
capacity to be C(snr)! In other words, 

Csi-EiW,Ps) = C{sm). (5.23) 


The natural question, in the spirit of this monograph, is whether there is a degradation to higher- 
order terms in the asymptotic expansion of logarithm of the maximum code size of the channel for a fixed 
average error probability (cf. the AWGN case in Theorem 4.4). Scarlett |134j and Jiang-Liu [89] showed the 
surprising result that there is no degradation up to the second-order dispersion term! Furthermore, Scarlett 
[134] showed that the state sequence only has to satisfy a very mild condition. In particular, it neither has to 
be Gaussian nor ergodic. The approach by Jiang-Liu jSS] is via lattice coding m- The proof sketch below 
follows Scarlett’s approach in HMj. 


Theorem 5.3. Assume that there exists some finite F > 0 such that 


Pr 




■n II2 


>r =o 


logn 


(5.24) 


For any snr S (0, oo), the maximum code size for average error probability no larger than e satisfies 

log , Ps^, e) = nC(snr) -I- \/nV(snr)$“^(e) -|- O(logn). (5.25) 
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The condition in (5.241 is mild. For example if 5'" is a zero-mean, i.i.d. process and F is chosen to be 


larger than £[5^], under the condition that E[5'f] < oo, the probability decays at least as fast as O(^) by 
Chebyshev’s inequality, thus satisfying ( 5.24[). 

Before we sketch the proof of Theorem |5.3[ let us recap Costa’s proof of the DPC capacity in ( |5.23 |. He 
assumes S is Gaussian with some variance inr and chooses U = X + aS, where X ^ A/’(0,snr) and S are 
independent. He then performs calculations which yield 


/(0;r) = bog( + 

2 \snr • inr(l — -I-(snr-I-a^inr) 


r^rr ^ ^ f snr-f a^jnr 

/(C';S)=-l<,g 


and 


/([/; r) -/([/; S) = 1 log ( . 

2 \snr • inr(l — a)^ + (snr -|- a^inr) J 


(5.26) 

(5.27) 

(5.28) 


Differentiating the final expression (5.28) with respect to a and setting it to zero shows that a* = 


snr+l 


independent of inr. Furthermore the expression (5.28) evaluated at a* yields C(snr) which is, of course, also 
independent of inr. So the important thing to note here is that I{U;Y) — I{U-,S) is independent of inr at 
the optimum a, which is the weight of the minimum mean squared error estimate of X given X Z. 


Proof sketch of Theorem |5.5| The main ideas of the proof are sketched here. The converse follows from 
Theorem |4.4| so we only have to prove achievability. We start with some preliminary definitions. 

The analogue of types of states which take values in Euclidean space and which we find helpful here is 
the notion of power types |109j . Fix ^ > 0 and consider the power type class 


Tnir) s e 


: T < -||s| 
n 




2 < T -\ - 

n 


(5.29) 


where r = ^. Intuitively, what we are doing is partitioning [0, oo) into small intervals, each of length 
For any sequence s S Tn{T), we say that its power type is r, i.e., nr < ||s||| < nr + f. Thus, the normalized 
square of the £ 2 “norm, quantized to the left endpoint of the interval [t,t + ^), is the power type of s. The 
set of all power types is denoted as Vn C [0,oo). 

Consider the following typical set of power types (also called typical types) 

rn:=rnn[o,r]. ( 5 . 30 ) 


Thus, we are simply truncating those power types r that are larger than F, the threshold in the statement of 
the theorem. Clearly, the size of the typical set of power types |'P„| = [Fn/^J = 0(n), which is polynomial 
in n. This is similar to the discrete case [HI Ch. 2]. Furthermore, by the assumption in (5.24), 


Pr(P..S<P,0=o(!2if). 


(5.31) 


We use the first 0(logn) symbols to transmit Ps^, the state type. The rest of the n — 0(logn) symbols are 
used to transmit the message. By using the theory of error exponents for the Gel’fand-Pinsker problem [115] 
and the fact that the number of state types is polynomial, one can show that Ps^ can be decoded with error 
probability 0{^). The 0(logn) symbols used to transmit the state type does not affect the dispersion 
term. In the following, with a slight abuse of notation, n refers to the remaining channel uses. 

The decoder uses information density thresholding with respect to the joint distribution 

Psuvis^'^^y) := pjf\s)Pu\siy\s)PY\su{y\s,u) (5.32) 


where r indexes a power type, the state distribution is Pg^'^ = A/’(0,r), the conditional distributions 
P( 7 |s(js) = A/'(—as,snr) and PY\sui‘\s,u) = Af{u -I- (1 — a)s, 1). The corresponding mutual informations 
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induced by the joint distribution in (5.321 are denoted as and I^'^\U\Y). The constant a > 0 is 

arbitrary for now. 

With these preparations, we are ready to prove Theorem |5.3| and we divide the proof into several steps. 
Step 1 (Codebook Generation): The number of auxiliary codewords for each type r S is denoted as 
For each state type t G Vn and each message m G {1,..., M}, generate a type-dependent codebook 
consisting of codewords {17”(m, 1) : m G (1,..., M}, 1 G where each codeword is drawn 

independently from 


:= 


<^{HuH2-»(snr-faV)} 

An{y/n{snr + a'^r)) 


(5.33) 


That is, similar to the proof of Theorem 4.4 we uniformly generate codewords U^{m, 1) from a sphere in K” 
with radius depending on the type, namely -y/n(snr -|- a^r). 

Step 2 (Encoding): Given the state sequence S'" and message m, the encoder first calculates the type of 
S", denoted as r. If r is not typical in the sense of (5.311 declare an error. The contribution to the overall 
error probability is given in (5.311 which is easily seen to not affect the second-order term in (5.251. If r 
is typical, the encoder then proceeds to find an index 1 S (1,..., such that 17 ”(to, 1) is typical in the 

sense that 


117”(to, 1) — Q!S ”||2 € [usnr — ry, nsnr], 


(5.34) 


where 77 > 0 is chosen to be a small constant. If there are multiple such 1, choose one with the smallest 
index. If there is none, declare an encoding error. The encoder transmits X” := 17 ”(to, 1) — aS”". Glearly 
the power constraint on 77” in (5.20) is satisfied with probability one. 

Step 3 (Decoding): Given the channel output y and the state type r, the decoder looks for a codeword 
G such that 

Y\uy 


-,(Ai 


u(mj),y) := ^log 


77{',;,(2/i|ui(TO,/)) 




M 


> 7 


M 


(5.35) 


i=l ^ V iVi) 

where 7 ^^ is a power type-dependent threshold to be chosen in the following. The distribution pjjy is 
dehned according to (5.32) and is simply an information density indexed by the power type t. 

Step 4 (Analysis of Error Probability): Assume to = 1. Let r be the power type of the state S'”. Let I 
be the chosen index in the encoder step. Glearly, the error event is the union of the following two events: 


£:^:=|vt 7 ”(l,l) : ||17”(1,1) - aS”||J ^ [nsnr-??, nsnr]} 

£p := I Decoder estimates an to l| 

If we set the number of auxiliary codewords for type class indexed by r to be 

logL^”^ := S) + Ki log n, 


(5.36) 

(5.37) 

(5.38) 


for some ki > 0, then by techniques similar to the covering lemma [49j . we can show that 

Pr (Sc I Ps" = t ) < exp(—-i/jn) (5.39) 

for some V' > 0 and all typical types t GVn- The event Sp can be analyzed per Feinstein-style [53] threshold 
decoding as follows: 

Pr (Sp I £■{, Ps^ = r) < Pr , F”) < 7 ^"^ | ff, Ps- = r) 

-f Mp(”)Pr(g(”)(C7”,r”) >7(”)|S{,Ps. = r) (5.40) 

where P" ~ pjj.} is independent of F”. By a change-of-measure argument similar to 
AWGN case, one can show that if 7 ^^ is chosen to be 

7 P) = logM + n/^”^(P; S) -I- K 2 logn (5-41) 


(4.108)-(4.109) for the 
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where K 2 ■= + 1, then the second term in (5.40) decays as 0{^). So it remains to analyze the first-term. 

We do so using the Berry-Esseen theorem and the fact that with a* = , for any t G Vn and any s and 

u in the support of Ps’^,U’^,y^ conditioned on and Pg" = r, 


E[gfo)(C/",y”)|S'” = s,C/" = u] =nJ(^)(C/;y)-hCl(l), and (5.42) 

Var [g(^)(C/",y”)|S'” = s,C/" = u] = nV(snr)-h 0(1). (5.43) 


The proof is completed by noting that for a* = the difference of mutual informations I^'^\U]Y) — 

/fo^(17; S) equals C( snr) for every power type r (in fact every variance) as we discussed prior to the start of 
this proof. □ 


5.4 Mixed Channels 

In this section, we consider state-dependent DMCs W G ^^{y\X x S) where the state sequence is random 
but fixed throughout the entire transmission block once it is determined at the start. This class of channels 
is known as mixed channels Sec. 3.3]. The precise setup is as follows. Let S' be a state random variable 
with a binary alphabet S = {0,1} and let Ps be its distribution. We consider two DMCs, each indexed by 
a state s G S. These DMCs are denoted as Wq := lT(-|-,0) and Wi := 1) and have capacities C{Wq) 

and C{Wi) respectively. Without loss of generality, we assume that C'(lTo) < C'(Wi). We also assume that 
each of these channels has a unique CAID and the CAIDs coincide]^ Their e-channel dispersions (cf. (4.27)- 
(4.28)) are denoted by V{Wq) and V{Wi) respectively. The e-dispersions are assumed to be positive and 
are independent of e because the CAIDs are unique. 

Before transmission begins, the entire state sequence = {S,..., S) G {0,1}” is determined. Note that 
the probability that the DMC is Ws is := Ps{s). The realization of the state is known to neither the 
encoder nor the decoder. The probability of observing the sequence y G given an input sequence x G A” 
is 


Pr(y’^ = ylX’ 


= x) = ^ TT, 

sGS 




=■ wd"i(y|x), 


(5.44) 


explaining the term mixed channels. We let Ps^s) denote the maximum number of messages that 

can be transmitted through the channel W" when the state distribution is Ps and if the tolerable average 
error probability is e S (0,1). 

The class of mixed channels is the prototypical one in which the strong converse property [67l Sec. 3.5] 
does not hold in general. This means that the e-capacity 


Ce{W, Ps) ■■= liminf - Ps, e) 

n—¥oo Ti 


(5.45) 


depends on e in general. To state Ce{W,Ps) for binary state distributions, we consider three different cases: 
Case (i): C(Wo) = C{Wi) and relative magnitudes of e and tto are arbitrary 
Case (ii): C{W{)) < C{Wi) and e < ttq 
C ase (iii): C{Wq) < C{Wi) and £ > tto 
I t is known that [571 Sec. 3.3] that 


r C{Wo) = C{Wi) Case (i) 
C,{W,Ps)={ C{Wo) Case(ii) 

[ C{Wi) Case (iii) 


(5.46) 


A plot of the e-capacity is provided in Fig. |5.4| 

^An example of this would be two binary symmetric channels. Both the CAIDs are uniform distributions on {0,1} and they 
are clearly unique. 
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Ce{W,Ps) 



Figure 5.4: Plot of the e-capacity against e for the case C{Wo) < C{Wi). The strong converse property [STJ 
Sec. 3.5] holds iff C{Wq) = C{Wi) in which case Ce{W,Ps) does not depend on e. 


The following theorem was proved for the special case of Gilbert-Elliott channels [Sni Uni Ell by 
Polyanskiy-Poor-Verdii [1241 Thm. 7] where Wq and Wi are binary symmetric channels so their CAIDs 
are uniform on X. The coefficient L(e; W, Ps) S K in the asymptotic expansion 

log Ps,s) = nC,iW, Ps) + y/nL{e- IT, Ps) + o[y/n), (5.47) 


was sought. This coefficient is termed the second-order coding rate. In the following theorem, we state and 
prove a more general version of the result by Polyanskiy-Poor-Verdii |124[ Thm. 7]. For a result imposing 
even less restrictive assumptions, we refer the reader to the work by Yagi and Nomura |185j . 

Theorem 5.4. Assume that each channel Ws,s € S has a unique CAID and the CAIDs coincide. In the 
various cases above, the second-order coding rate is given as follows: 

Case (i): L{e',W,Ps) is the solution I to the following equation: 


TTo 4 > 




TTl $ 


\AW) 


= s. 


Case (a): 

Case (Hi): 

If e = TTo, then L(s; W, Ps) = —oo. 


L{e- W, Ps) = VT(ITo) $-1 . 


L(e;W,Ps) = V^(ITi)4>- 


-1 f £ — '^0 

TTl 


(5.48) 

(5.49) 

(5.50) 


We observe that in Case (i) where the capacities C(Wo) and C'(ITi) coincide (but not necessarily the 
dispersions), the second-order coding rate is a function of both the dispersions V{Wo) and V(ITi), together 
with TTo and e. This function also involves two Gaussian cdfs, suggesting, in the proof, that we apply the 
central limit theorem twice. In the case where one capacity is strictly smaller than another (Cases (ii) and 
(iii)), there is only one Gaussian cdf, which means that one of the two channels dominates the overall system 
behavior. Intuitively for Case (ii), the first order term is C'(IFo) < C'(ITi) and e < ttq, so the channel with 
the smaller capacity dominates the asymptotic behavior of the channel, resulting in the second-order term 
being solely dependent on V{Wq). In Case (iii), since e > tto, we can tolerate a higher error probability so 
the channel with the larger capacity dominates the asymptotic behavior. Hence, L{e; W, Ps) depends only 
on V(Wi). 

The corresponding result for source coding, random number generation and Slepian-Wolf coding were 


derived by Nomura-Han |117l 1118] . We only provide a proof sketch of Case (i) in Theorem 5.4 here. 
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Proof sketch of Case (i) in Theorem 5.4 For the direct part of Case (i), we specialize Feinstein’s theorem 
(Proposition |4.1[ ) with the input distribution chosen to be the n-fold product of the common CAID of Wq 
and VFi, denoted as P G ^(A). Recall the definition of bFm”x(yl^) (5.44). By the law of total probability, 
the probability defining the (e — ? 7 )-information spectrum divergence simplifies as follows: 




(yn) 


= TT, Pr 

sGS 


iog..^(y;.i^").<R 


(5.51) 

(5.52) 


where YJ^,s G S denotes the output of IF" when the input is A". Fix 7 > 0. Consider the probability 


indexed by s = 0 in (5.52): 


po := Pr log 


lFo"(Fo"l^”) 

(PlFo)"(Fo”) 






< Pr 


log ^o"(^o"l-^") +log 


(PlFo)"((Po”) 

+ Pr(Fo" 


p^w^iLiyo) 


Fo" e A, 


G 


where the set 


Because Yff 


A.y := 


y G F” : log 


(PlFo)"(y) 

pn]y(A (y) 

mix / 


> -7 


(5.53) 


(5.54) 


(5.55) 


(PlFo)”, we have Vv{A'L) < exp(— 7 ). This, together with the definition of A.y, implies that 


Po < Pr [^log 

R + 


< $ 


W"o"(Fo"l^") 
(PtFo)"(Fo") 
7 - nCfWo) 


VnV{Wo) 


< P + 7 ) + exp(- 7 ) 

+ 0(^1 +exp(- 7 ), 


(5.56) 

(5.57) 


where the final step follows from the i.i.d. version of the Berry-Esseen theorem (Theorem [I3- The same 
technique can be used to upper bound the second probability in (5.52). Choosing 7 = ^ and 7 = ^ logn 
results in 


/'P + i logn - nCs 

7r,g$' 


sGS 


VnVs 


o( ^ 


(5.58) 


Now we substitute this bound on p into the definition of (e—? 7 )-information spectrum divergence in Feinstein’s 
theorem. We note that C'(lFo) = C{Wi) and thus may solve for a lower bound of P. This then completes 


the direct part of Case (i) in (5.48). Notice that for Case (ii), all the derivations up to (5.58) hold verbatim. 


However, note that since Ci;(W,Ps) = C'(lFo), we have that P = nC'(Wo) + l^/n + o{^/n) for some I G M. 
By virtue of the fact that C'(lFo) < C'(lFi), the second term in ( |5.58 ) vanishes asymptotically and we 
recover (5.49) which involves only one Gaussian cdf. 


For the converse part of Case (i), we appeal to the symbol-wise converse bound (Proposition |4.4[ ) . For 
a fixed x G F" and arbitrary output distribution QA) £ the probability that defines the (e + rj)- 

information spectrum divergence can be written as 


q := Pr 


^ TF^”Vf"|x) 

log _LF 

^ ^ (5(")(F") 

W^iYf 


< R 


= > TTs Pr log 


s£S 


Q(")(yn) 


< p 


(5.59) 


(5.60) 
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where Y", s G S is the output of WJ* given input x. Now choose the output distribution to be 


where for each s G S, 


Now note that 


Q^rHy) ■■= E 


P^e&’„{x) ' ^ '' i=i 


l[P.Ws{y^) 




J[P^w,{y^ 


(5.61) 


(5.62) 


(5.63) 


for any s G S and type P^ G ^{X). By sifting out the type corresponding to x for channel Wq, the 
probability in (5.60) corresponding to s = 0 can be lower bounded as 

qo > Pr ( log ■ 


^o"(h)r|x) 

(PxW'o)"(ro" 


<R-\og{2\3^n{X)\) 


(5.64) 


By separately considering types close to (Berry-Esseen) and far away (Chebyshev) from the CAID similarly 
to the proof of Theorem 4.3 (or [T^ Thm. 3]), we can show that (|5.64[) simplifies to 


qo > ^’ 


VnV\Wo) ' 




(5.65) 


uniformly for all x G A". The same calculatio n ho lds for the second probability in (5.60). By choosing 
f] = we can upper bound R using Proposition 4.4 and the converse proof of Case (i) can be completed. □ 


We observe that the crux of the above proof is to use the law of total probability to write the probabilities 
in the information spectrum divergences as convex combination of constituent probabilities involving non- 
mixed channels. For the direct part, a change-of-output-measure by conditioning on the event G A-y in 
(5.54) is required. For the converse part, the proof proceeds in a manner similar to the converse proof for 
the second-order asymptotics for DMCs, upon choosing the auxiliary output measure appropriately. 


5.5 Quasi-Static Fading Channels 


The final channel with state we consider in this chapter is the quasi-static single-input-multiple-output 
(SIMO) channel with r receive antennas. The term quasi-static means that the channel statistics (fad¬ 
ing coefficients) remain constant during the transmission of each codeword, similarly to mixed channels. 
Yang-Durisi-Koch-Polyanskiy [186] derived asymptotic expansions for this channel model which is described 
precisely as follows: For time i = 1,... ,n, the channel law is given as 


'Y,i' 


'Hi 


Zil 


= 


Xi -\- 


Y„_ 


Hr 




where := {Hi,..., Hr)' is the vector of (real-valued) i.i.d. fading coefficients, which are random but 
remain constant for all channel uses, and {Zij} are i.i.d. noises distributed as A/'(0,1). In the theory of 
fading channels m, the channel inputs and outputs are usually complex-valued, but to illustrate the key 
ideas, it is sufficient to consider real-valued channels and fading coefficients. In this section, we restrict our 
attention to the real-valued SIMO model in (5.66). The channel input Y” must satisfy 


(5.67) 

2=1 
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with probability one for some permissible power snr > 0. 

Two different setups are considered. First, both the encoder and decoder do not have information about 
the realization of . Second, both the encoder and decoder have this information. 

For a given distribution on the fading coefficients Ph^ (this plays the role of the state or side informa¬ 
tion), define A<r*Q_gj(fF”, Pf/r-, snr, e) and Pf/'-, snr, e) to be the maximum number of codewords 

transmissible over n independent uses of the channel under constraint (5.67), with fading distribution Pffr, 
and with average error probability not exceeding e under the no side information and complete knowledge 
of side information settings respectively. It is known using the theory of general channels [1691 Thm. 6] that 
for every e e (0,1), the following limits exist and are equal 


lim -logM*„_si(IF”,PHr.,snr,£) 

n—^oo Ti 

= lim - logMsi_ED(W^”,-Pff'-,snr,£). (5.68) 

n—^oo n 

Their common value is the e-capacity [TB], defined as 

Ce{W,PH^) := sup e M : P(^;snr,Pijr-) < sj, (5.69) 

where the outage function is defined as 

P(C; snr, P^.) := Pr (c(snr|lPni 2 ) < ?) (5-70) 


Observe that for a fixed value of P’’ = h (i.e., the channel state is not random), the expression C(snr||h|||) 
is simply the Shannon capacity of the channel. Beyond the first-order characterization, what are the refined 
asymptotics of logM*Q_gj(kF”, Pf/r, snr, e) and logMgj_Ej)(kF", P^/r, snr, e)? The following surprising result 
was proved by Yang-Durisi-Koch-Polyanskiy [186] . 


Theorem 5.5. Assume that the random variable G = ||P’^||| has a pdf that is twice continuously differ¬ 
entiable and that Ce{W, Ph^) in (5.69) is a point of growth of the outage function defined in (5.70), i.e., 
F'iCeiW, Ph-); snr, Ph-)> 0. Then 


log M*^_S^{W^,Ph^, snr, e) = nCe{W,PH^)-\-O(log n), and (5.71) 

logMg^_j^j^{W"-,PH-,snr,e) = nC^fW, Ph^) -I- 0{logn). (5.72) 


The condition on the channel gain G is satisfied by many fading models of interest, including Rayleigh, 
Rician and Nakagami. 

Theorem 5.5 says interestingly that, in the quasi-static setting, the Q{^/n) dispersion terms that we 
usually see in asymptotic expansions are absent. This means that the e-capacity is good benchmark for the 
finite blocklength fundamental limits logM*Q_gj(lF", Pf/r,snr,e) and logMgj_gj 3 (lF”,Pffr,snr,e) since the 
backoff from the e-capacity is of the order 0(^^^) and not the larger 0(;^). 

We will not detail the proof of Theorem 5.5 here, as it is rather involved. See m for the details. 
However, we will provide a plausibility argument as to why the Q{y/n) term is absent in the expansions in 
(5.71)-(5.72). Since the quasi-static fading channel is conditionally ergodic (meaning that given P’' = h, it 
is ergodic), one has that 


e*(lF",h,snr,M) « Pr (nC(snr||h||i) -h ^nV(snr||h||2) Z < logAf) (5.73) 

where e*(lF”, h, snr, M) the smallest error probability with M codewords and channel gains P"" = h, and Z 
is the standard normal random variable. Note that C(snr||h|| 2 ) and V(snr||h|| 2 ) are respectively the capacity 
and dispersion of the channels conditioned on P’' = h. If Z is independent of H'", the above probability is 
close to one in the “outage case”, i.e., when nC(snr||h|||) < logM. Hence, taking the expectation over P”, 

e*(lF'",PH.,snr,M) «Pr(nC(snr||Pnii) <logM), (5.74) 
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where e*(hF", Pf/r, snr, M) the smallest error probability with M codewords and random channel gains. In 
fact, the above argument can be formalized using the following lemma whose proof can be found in |18bj . 


Lemma 5.3. Let A be a random variable with zero mean, unit variance and finite third moment. Let B be 
independent of A with twice eontinuously differentiable pdf. Then, 


Pr (A < v^P) = Pr (P > 0) + O . 


The approximation in (5.74) is then justified by taking 


A = -yV(snr||iJ’'|||) Z, and B— \ogM — nC[snr\\H''\\^ 


(5.75) 


(5.76) 


Finally, we remark that this quasi-static SIMO model is different from that in Section H in two significant 
ways: First, the state here is a continuous random variable and second, according to (5.66), the quasi-static 
scenario here implies that the state iJ’’ is constant throughout transmission and does not vary across time 
i = 1,... ,n. This explains the difference in second-order behavior vis-a-vis the result in Theorem |5.2[ The 


distinction between this model and that in Section 5.4 on mixed channels with finitely many states is that 
the fading coefficients contained in are continuous random variables. 
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Chapter 6 

Distributed Lossless Source Coding 


It is not an exaggeration to say that one of the most surprising results in network information theory is 
the theorem by Slepian and Wolf |151] concerning distributed lossless source coding. For the lossless source 
coding problem as discussed extensively in Chapter it can be easily seen that if we would like to losslessly 
and reliably reconstruct X" from its compressed version and correlated side-information F” that is available 
to both encoder and decoder, then the minimum rate of compression is H{X\Y). What happens if the 
side information is only available to the decoder but not the encoder? Surprisingly, the minimum rate of 
compression is still H{X\Y)\ It hints at the encoder being able to perform some form of universal encoding 
regardless of the nature of whatever side-information is available to the decoder. 

A more general version of this problem is shown in Fig. |6.1| Here, two correlated sources are to be 
losslessly reconstructed in a distributed fashion. That is, encoder 1 sees Xi and not A 2 , and vice versa. 
Slepian and Wolf showed in m that if X" and AJ are generated from a discrete memoryless multiple 
source (DMMS) Pxfxj*, then the set of achievable rate pairs (i?i, A 2 ) belongs to the set 


Ai>H(Ai|A2), R2>H{X2\Xi), Ri+R2>H{XuX2). 


( 6 . 1 ) 


In this chapter, we analyze refinements to Slepian and Wolf’s seminal result. Essentially, we fix a point 
on the boundary of the region in (6.1). We then find all possible second-order coding rate pairs 
(Li,L 2 ) G such that there exists length-n block codes of sizes Mjn,j = 1,2 and error probabilities e„ 
such that 

logMjn < nR* y/nLj o(yy/n), and £n < e + o(l). (6.2) 


The latter condition means that the sequence of codes is e-reliable. We will see that if (i?*,i? 2 ) is a corner 
point, the set of all such (Li,L 2 ) is characterized in terms of a multivariate Gaussian cdf. This is the 
distinguishing feature compared to results in the previous chapters. 

The material in this chapter is based on the work bv Nomura and Han |118| and Tan and Kosut m- 


6.1 Definitions and Non-Asymptotic Bounds 


In this section, we set up the distributed lossless source coding problem formally and mention some known 


non-asymptotic bounds. Let Px^x^ G ^{Xi x X2) be a correlated source. See Fig. 6.1 

An {Ml, M2^e)-code for the correlated source Px^x^ G ^Y‘{Xi x X2) consists of a triplet of maps that 
includes two encoders fj : A, —)• {1,.. ., Mj} for j = 1,2 and a decoder ip : {1, ..., Mi} x {1,..., M2} -G 
Xi X X2 such that the error probability 

PxiX2{{ixi,X2) G Xix X2: p{fi{xi), f2{x2)) yf {XI,X2)}) < s. (6.3) 

The numbers Mi and M2 are called the sizes of the code. 
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Figure 6.1: Illustration of the Slepian-Wolf |151j problem. 


We now state known achievability and converse bounds due to Miyake and Kanaya m- See Theorems 
7.2.1 and 7.2.2 in Han’s book m for the proofs of these results. The achievability bound is based on Cover’s 
random binning |32j idea. 

Proposition 6.1 (Achievability Bound for Slepian-Wolf problem). For every 7 > 0, there exists an (Mi, M 2 , e)- 
code satisfying 


e < Pr log 


1 


> log Ml — 7 or 


Pxi\X2iXl\X2) 

log- - -j-- > log M 2 — 7 or 


log 


Px2\Xi{X2\Xi) 

1 

Px,X2{Xl,X2) 


> log(MiM 2 ) - 7 ) -I- 3 exp(- 7 ). 


(6.4) 


The converse bound is based on standard techniques in information spectrum EH Ch. 7] analysis. 

Proposition 6.2 (Converse Bound for Slepian-Wolf problem). For any 7 > 0, every (Mi, M 2 , e)-code must 
satisfy 


e > Pr log 


1 


> log Ml -I- 7 or 


Pxi\X2iXl\X2) 

log — - > log M 2 -I- 7 or 


Px2\xAX2\X,) 

Notice that the entropy density vector 


> log(MiM 2 )-I -7 ) - 3 exp(- 7 ). 


(6.5) 


h.XiX2{xi,X2) ■ — 


log 


f'xi|X 2 (^lp 2 ) 


log- 


X2\X 


i(^ 2 |a:i) 


log 


Px -^ X 2 {211,3^2) 


( 6 . 6 ) 


plays a prominent role in both the direct and converse bounds. 


6.2 Second-Order Asymptotics 

We would like to make concrete statements about performance of optimal codes with asymptotic error 
probabilities not exceeding e and blocklength n tending to infinity. For this purpose, we assume that the 
source PxiX 2 is a DMMS, i.e., 

n 

Px"X-(xi,X2) = ]JPxiX2(a:o,a:2i), V (xi, X2) S T” x ^2”. (6.7) 

2=1 
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As such, the alphabets Xj,j = 1,2 in the definition of an (Mi, M 2 ,£)-code are replaced by their n-fold 
Cartesian products. 


6.2.1 Definition of the Second-Order Rate Region and Remarks 


Unlike the point-to-point problems where the first-order fundamental limit is a single number (e.g., capacity 
for channel coding, rate-distortion function for lossy compression), for multi-terminal problems like the 
Slepian-Wolf problem there is a continuum of first-order fundamental limits. Hence, to define second-order 
quantities, we must “center” the analysis at a point on the boundary of the optimal rate region 

(in source coding scenarios) or capacity region (in channel coding settings). Subsequently, we can ask what 
is the local second-order behavior of the system in the vicinity of (i?(,i? 2 ). This is the essence of second- 
order asymptotics for multi-terminal problems. Note that for multi-terminal problems, we exclusively study 
second-order asymptotics, and we do not go beyond this to study third-order asymptotics. 

Fix a rate pair (i?*,i? 2 ) on the boundary of the optimal rate region given by (6.1). Let (Li,L 2 ) € 


be called an achievable (e, R\, R2)- second-order coding rate pair if there exists a sequence of {Min, M 2 n, £«)- 
codes for the correlated source PjcfX" such that the sequence of error probabilities does not exceed e 
asymptotically, i.e., 

limsup£„ < £ (6.8) 


and furthermore, the size of the codes satisfy 


1 


limsup —j= (logMjn — nR*) < Lj, j = 1, 2. 


(6.9) 


The set of all achievable (£, i?(, i? 2 )“Second-order coding rate pairs is denoted as £{e; Rl, R 2 ) C the 
second-order coding rate region. Note that even though we term the elements of £(£; R\, R 2 ) as “rates”, they 
could be negative. This convention follows that in Hayashi’s works [751176]. The number Lj has units is bits 
per square-root source symbols. 

Let us pause for a moment to understand the above definition as it is a recurring theme in subsequent 
chapters on second-order asymptotics in network information theory. Slepian-Wolf |151j showed that there 
exists a sequence of codes for the (stationary, memory less) correlated source (Xi,X 2 ) whose error probabil¬ 
ities vanish asymptotically (i.e., £„ = o(l)) and whose sizes Mjn satisfy 


limsup — log Mjn < Rj, j = 1, 2 


( 6 . 10 ) 


where the rates i?i and R 2 satisfy the bounds in (6.1). Hence, the definition of a second-order coding rate 


pair in (6.9) is a refinement of the scaling of the code sizes in Slepian-Wolf’s setting, centering the rate 
analysis at (RIjR^), and analyzing deviations of order ©(-)=) from this first-order fundamental limit. In 


doing so, we allow the error probability to be non-vanishing per (6.8). This requirement is subtly different 
from that in the chapters on source and channel coding where we are interested in approximating non- 
asymptotic fundamental limits like logM*(P, £) or log M*„g(kF, £) and therein, the error probabilities are 
constrained to be no larger than a non-vanishing e G (0,1) for all blocklengths. Here we allow some slack 
(i.e., £ra < £ + 0 ( 1 )). This turns out to be immaterial from the perspective of second-order asymptotics, 
as we are seeking to characterize a region of second-order rates C{e; Rl, R 2 ) and we are not attempting 
to characterize higher-order (i.e., third-order) terms in an asymptotic expansion. The o(l) slack affects 
the third-order asymptotics but since we are not interested in this study for network information theory 


problems, we find it convenient to define C{e', R^, R 2 ) using (6.8)-(6.9), analogous to information spectrum 
analysis EH- 

Before we state the main result of this chapter, let us consider the following bivariate generalization of 
the cdf of a Gaussian: 


4'(ti,t2;M,S) := 


A/'(x; fi; S) dx. 


( 6 . 11 ) 


' —00 —00 
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Figure 6.2: Illustration of the different cases in Theorem 6.1 where Hi = H{Xi) and iJ 2 |i = etc. 

The curve is a schematic of the boundary of the set of rate pairs achievable at blocklength n with 

error probability no more than £ < |. The set is denoted by ^sw( n, e). 


where A/'(x;/x; S) is the pdf of a bivariate Gaussian, defined in (1.41). 
matrix 


Also define the source dispersion 


V = V{Px,x,) ■■= Cov [h(Ai, Aa)] 

fA|2 Pl.2-\/Vi|2b2|l 

= Pl,2\/ W|2^2|l ^2|1_ 

.Pl,12\/Vi|2bl,2 P2,12\/V2 |iVi_2 


/31,12-\/Vi|2Vi,2 
P2,12\/b2|l Ai,2 
bl,2 


( 6 . 12 ) 

(6.13) 


We also denote the diagonal entries as V{Xi\X 2 ) = Vi| 2 , f^(A 2 |Ai) = V 211 and V{Xi,X 2 ) = Vi. 2 - Define 
V 1.12 (resp. V 2 P 2 ) as the 2x2 submatrix indexed by the 1®* (resp. 2"^^) and 3"''^ entries of V, i.e., 


V 


1,12 


Pi,12 


f^2_ 

\/Di|2Di_2 


P1.12-\/Vi|2Vi_2 

Vl,2 


(6.14) 


and V 2 P 2 is defined similarly. 


6.2.2 Main Result: Second-Order Coding Rate Region 

The set £(e; i?^) is characterized in the following result. This result was proved by Nomura and Han |118) . 

A slightly different form of this result was proved earlier by Tan and Kosut |157j . 


there are 5 cases of 

which we state 3 explicitly: 

Case (i): = H{Xi\X 2 ) and R\ > H{X 2 ) (vertical boundary) 


Theorem 6.1. Assume V is positive definite. Depending on {R^yR^) (see Fig. 6.2 


Cie-y Rl R* 2 ) = {(Li,L 2 ) :Di > ^/V{Xl\X 2 )^-\l - e)}. (6.15) 

Case (a): i?* + i?2 = ^t(Ai, A2) and H{Xi\X2) < R* < H{Xi) (diagonal faee) 

C{e-yRlyR* 2 ) = [{LiyL2) :Li+L 2> ^JV{XlyX2)<^-\l - £)}. (6.16) 

Case (Hi): i?* = H{Xi\X2) and i?2 = H{X2) (top-left corner point) 

Cie-yR*iyR*2) = {{LiyL2) : ^(Li, Li + L 2 ; 0, Vi,i2) > 1-s}. (6.17) 
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Figure 6.3: Illustration of £(e;i?J,i? 2 ) (6.17) for the source PxiX2 in (6.18) with e = 0.01,0.1. 

regions are to the top right of the boundaries indicated. 


The 


75 


























The region £(e; ii^) for Case (iii) is illustrated in Fig. 6.3 for a binary source (Xi, X 2 ) with distribution 


PxiX.2{Xl,X2) = 


0.7 0.1 

0.1 0.1 


(6.18) 


Note that C{e-, Rl, R 2 ) for other points on the boundary can be found by symmetry. For example for the 
horizontal boundary, simply interchange the indices 1 and 2 in (6.15). The case in which V is not positive 
definite was dealt with in detail in |157| . 


6.2.3 Proof of Main Result and Remarks 


Proof. The proof of the direct part specializes the non-asymptotic bound in Proposition 6.1 with the choice 
-y = vfP. Choose code sizes Mi„ and M 2 „ to be the smallest integers satisfying 


log Mjn > nR* + y/nLj + j = 1,2, 


(6.19) 


for some (Li,L 2 ) € Substitute these choices into the probability in (6.4), denoted as p. The comple¬ 
mentary probability 1 — p is 


l-p = Pr hx^x^{X^,X^)< 


nR\ + \/nLi + vfP 
ni?2 + '/nL2 + 
n(i?* -|-i?2) T-\/ri(Ti -\- L2) -\- 3 n^P 


( 6 . 20 ) 


Recall that hjc^x"(xi,X 2 ) is the entropy density in ( | 6 . 6 [ ) and that inequalities (like <) are applied element¬ 
wise. The three events in the probability above are 


Ai:='{- log 
' n 

1 


1 . o* , „ 

n Px^\X 5 {X^\X^) < + Vn + ” 


.42 

Ai2 ■ = 


1 


n Px^\x^{X^\X^) 

-log ^ 

n ^ Px^x^iX^,X^) 


< i?2 H-+ XI 

Vn 


- 3/4 

- 3/4 


and 


<RI + R*2 + ^^^^‘^ + 3n-3/4. 

i 


( 6 . 21 ) 

( 6 . 22 ) 

(6.23) 


As such, the probability in (6.20) is Pr(Ai 71^2 14 A 12 ). 

Let us consider Case (i) in Theorem 6.1 In this case, i ?2 > H{X 2 ) and R\ + R 2 > H{Xi,X 2 ). By the 
weak law of large numbers, Pr(A 2 ) —t 1 and Pr(Ai 2 ) —)■ 1 as n grows. In fact, these probabilities converge 
to one exponentially fast. Thus, 

1 — p > Pr(Ai)-I-exp(—n^) (6-24) 

for some ^ > 0. Furthermore, because R\ = H{Xi\X 2 ), Pr(Ai) can be estimated using the Berry-Esseen 
theorem as 

(6.25) 


Hence, one has 




,_ , , ..... (6.26) 

Vk(Xi|X 2 )/ 

Coupled with the fact that exp(— 7 ) = exp(—n^/^), the proof of the direct part of (6.15) is complete. The 
converse employs essentially the same technique. Case (ii) is also similar with the exception that now 
Pr(Ai) —>■ 1 and Pr(A 2 ) —>■ 1, while Pr(Ai 2 ) is estimated using the Berry-Esseen theorem. 

We are left with Case (iii). In this case, only Pr(A 2 ) —t 1. Thus, just as in (6.24), (6.20) can be estimated 
as 

1 - p > Pr(Ai n A 12 )-I-exp(-n^') (6.27) 
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for some > 0. The probability can now be estimated using the multivariate Berry-Esseen theorem 
(Corollary 1 1.1[ ) as 

VY{Alf^Al2) > ^'(Li,Li +L2;0 ,Vi,i2) (6.28) 


We complete the proof of (6.17) similarly to Case (i). The converse is completely analogous. 


□ 


A couple of take-home messages are in order: 

First, consider Case (i). In this case, we are operating “far away” from the constraint concerning the 
second rate and the sum rate constraint. This corresponds to the events A 2 and A'i 2 - Thus, by the theory 
of large deviations, Pr(^ 2 ) ^md Pr(Ai 2 ) both tend to zero exponentially fast. Essentially for these two error 
events, we are in the error exponents regime^ The same holds true for Case (ii). 

Second, consider Case (iii). This is the most interesting case for the second-order asymptotics for the 
Slepian-Wolf problem. We are operating at a corner point and are far away from the second rate constraint, 
i.e., in the error exponents regime for The remaining two events Ai and A 12 are, however, still in the 
central limit regime and hence their joint probability must be estimated using the multivariate Berry-Esseen 
theorem. Instead of the result being expressible in terms of a univariate Gaussian cdf $ (which is the case 
for single-terminal problems in Part II of this monograph), the multivariate version of the Gaussian cdf d', 
parameterized by the (in general, full) covariance matrix Vi ,12 in ( |6.14| ), must be employed. Gompared to 
the cooperative case where X 2 (resp. A”) is available to encoder I (resp. encoder 2), we see from the result 
in Gase (iii) that Slepian-Wolf coding, in general, incurs a rate-loss over the case where side-information is 
available to all terminals. Indeed, when side-information is available at all terminals, the matrix Vi_i 2 that 
characterizes £{s;Rl,R 2 ) in Case (iii) would be diagonal |157j . since the source coding problems involving 
Xi and X 2 are now independent of each other. In other words, in this case, there exists a sequence of codes 
with error probabilities satisfying (|6.8[) and sizes (Mi„,M 2 „) satisfying 


logMi„ < nH{Xi\X2) - VnE(Ai|A2)$-^(e) + o(v^), 
logM2„ < niJ(A2|Ai) - y/nE(A2|Ai)$-i(e) + o(y^), 
log(Mi„M2„) < niJ(Ai, A 2 ) - Vnl/(Ai,A2)$-i(e) + o(v^). 


(6.29) 

(6.30) 

(6.31) 


Inequality (6.29) corresponds to the problem of source coding Xi with X 2 available as full (non-coded) 


side information at the decoder. Inequality (6.30) swaps the role of Xi and A 2 . Finally, inequality (6.31) 


corresponds to lossless source coding of the vector source (Ai, A 2 ), similarly to the result on lossless source 
coding without side information in Section |3.2[ 


Second-Order Asymptotics of Slepian-Wolf Coding via the 
Method of Types 


6.3 


Just as in Section [3^ (second-order asymptotics of lossless data compression via the method of types), we 
can show that codes that do not necessarily have to have full knowledge of the source statistics (i.e., partially 
universal source codes) can achieve the second-order coding rate region Rl, R^). However, the coding 
scheme does require the knowledge of the entropies together with the pair of second-order rates (Li, L 2 ) we 
would like to achieve. We illustrate the achievability proof technique for Case (iii) of Theorem |6.1[ in which 
Rl = H{Xi\X 2 ) and R^ = H{X 2 ). 

The code construction is based on Cover’s random binning idea [35] and the decoding strategy is similar 
to minimum empirical entropy decoding [351133|. Fix (Li,L 2 ) G '^(e; AJ) where £(e;i?*,i? 2 ) is given 
in (6.17). Also fix code sizes Min and M 2 n satisfying (6.19). For each j = 1,2, uniformly and indepen¬ 
dently assign each sequence xj € A” into one of bins labeled as G {1,. ■. The bin 

assignments are revealed to all parties. To send Xj G A" 


encoder j transmits its bin index mj. 


^ Of course, the error exponents for the Slepian-Wolf problem are known 13611581 but any exponential bound suffices for our 
purposes here. 
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The decoder, upon receipt of the bin indices {mi,m 2 ) £ Mi„} x M 2 „}, finds a pair of 

sequences (xi,X 2 ) £ Bi{mi) x ^ 62 (m 2 ) satisfying 



■i7(xi|x2)‘ 


7i 

H(xi,X2) := 

n(x2|xi) 

< 

72 


.n(xi,x2). 


7i2_ 


for some thresholds 71 , 72,712 defined as 


71 ■.= H{Xi\X2) + ^+n-P^ 

\/n 

(6.33) 

72 := H{X 2 ) + ^ + 

Vn 

(6.34) 

712 :=H{Xi,X 2 ) + ^^^ ^^ +n-P^ 
y/n 

(6.35) 

If there is no sequence pair (xi, X 2 ) £ Pi (mi) x P 2 (m 2 ) satisfying (6.32) or if there is more than one, declare 
an error. Note that the thresholds depend on the entropies and {Li,L 2 ), hence these values need to be 
known to the decoder. 

Let the generated source sequences be Xy and Xy and their associated bin indices be Mi = Mi{Xy) 
and M 2 = M 2 {Xy) respectively. By symmetry, we may assume that Mi = M 2 = 1. The error events are as 
follows: 

£ii:={k{xy,xy)^-i}, 

(6.36) 

£i := {3xi £ Pi(l) : Xi ^ Xy,'k{S^i,Xy) < 7 }, 

(6.37) 

£2 := { 3 X 2 £ P 2 (l) : X 2 ^ X”,H(X”,X 2 ) < 7 }, and 
£12 := {3 (xi,X 2 ) £ Pi(l) X P 2 (l) : Xi ^ X”,X 2 ^ X", 

(6.38) 

H(xi,X2) < 7 }. 

(6.39) 


Let ^{Xi, X 2 ) = [H{Xi\X 2 ), H{X 2 \Xi), H{Xi, X 2 )y■ It can be verified that the following central limit 
relation holds m- 


Vn{H{xy,Xy)-ll{Xi,X2)) AaA(0,V). 


(6.40) 


This is the multi-dimensional analogue of ( |3.33 ) for almost lossless source coding. Thus, by the same argu¬ 
ment as that in (6.27)-(6.28 1 (ignoring the second entry in H(X”,X^) because = ^(-^ 2 ) > ^^(-’^ 2 |-^i)), 
one has 


Pr(fo) < £ + 0 (n ^/^). 


(6.41) 


Furthermore, by using the method of types, we may verify that 


Pr(fi)< ^ Pxrxj(xi,X 2 ) ^ Pr(xi £ Bi{l)) 

Xi^xi:H(xi.X2)<7 

Pr(5i:i £ 61 ( 1 )) 

S;i^xi:if(xi|x2)<7l 

1 


= Y, PX^X5{XI,K2) Y. 


xi^xi:ff(xi|x2)<7i 




In 


<Y ^xyxs{xi,X2) Y Y. 


VG r„ (^-2 ): ii GTv (X 2 ) 


M- 


In 


(6.42) 

(6.43) 

(6.44) 

(6.45) 
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< ^ PxrXj(xi,X2) ^ 


exp {nH{V\P^^)) 


XI,X2 VGr„(A'2;P*2): 

< J2 ^XrXj(xi,X2) ^ 

xi,X2 VG'P„(A'2 ;Px2) 

< (n + exp(— 


Ml 


exp (n 7 i) 


M 


In 


(6.46) 


(6.47) 

(6.48) 


where in (6.44) we used the uniformity of the binning, in (6.45) we partitioned the set of sequences Xi int o 
conditional types given X 2 and in ( 6.46| ), we used the fact that | 7 y(x 2 )| < exp {^nH{V\Px 2 )) (cf- Lemma 1.2). 
Finally, the type counting lemma and the choices of 71 and Mi„ were used in ( 6.48| ). The same calculation 
can be performed for Pr(£ 2 ) and Pr(£i 2 ). Thus, asymptotically, the error probability is no larger than e, as 
desired. 


6.4 Other Fixed Error Asymptotic Notions 

In the preceding sections, we were solely concerned with the deviations of order 0(^) away from the 
first-order fundamental limit (i?)",!?^)- However, one may also be interested in other metrics that quantify 
backoffs from particular first-order fundamental limits. Here we mention three other quantities that have 
appeared in the literature. 


6.4.1 Weighted Sum-Rate Dispersion 

For constants a,/? > 0, the minimum value of aRi + / 3 i ?2 for asymptotically achievable (i?i,i? 2 ) is called 
the optimal weighted sum-rate. Of particular interest is the case a = /? = 1, corresponding to the standard 
sum-rate i?i R 2 , but other cases may be important as well, e.g., if transmitting from encoder 1 is more 
costly than transmitting from encoder 2. Because of the polygonal shape of the optimal region described in 
the Slepian-Wolf region in (6.1), the optimal weighted sum-rate is always achieved at (at least) one of the 
two corner points, and the optimal rate is given by 


r aH{Xi\X 2 ) + PH{X 2 ) a >13 
{ aH{Xi)+ PH{X 2 \Xi) a <13 


(6.49) 


One can then define J S K to be an achievable {e,a, j3)-weighted second-order coding rate if there exists a 
sequence of (Mi„, M 2 „, e„)-codes for the correlated source Px^x^ such that the error probability condition 
in ( 6 . 8 ) holds and 

limsup ^ (a log Mi„ -I- /3 log M 2 „ - (a, ,d)) < J. (6.50) 


In |157j . the smallest such J, denoted as J*{e-,a,f3), was found using a proof technique similar to that for 
Theorem 16.11 


6.4.2 Dispersion-Angle Pairs 


One can also imagine approaching a point on the boundary (i?J, i?^) fixing an angle of approach 9 £ [0, 27r). 
Let (F, 6) be called an achievable (e, R 2 )-dispersion-angle pair if there exists a sequence of (Mi„, M 2 „, En)- 
codes for the correlated source Fx"X" such that the error probability condition in (6.8) holds and 


lim sup —f= ( log Min - 

n—foo v 

lim sup ^ (log M 2 n - 

n—>-oo v 


nRfi) < Vf cos 9, 
nR^ < Vf sin 0 . 


and 


(6.51) 

(6.52) 
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Clearly, dispersion-angle pairs (F, 0) are in one-to-one correspondence with second-order coding rate pairs 
{Li,L 2 ). The minimum such F for a given 6, denoted as F*(0, e; i?*, F 2 ), measures the speed of approach 
to at an angle 9. This fundamental quantity F*{9,e-, Rl, R 2 ) was also characterized in |157j . 


6.4.3 Global Approaches 

Authors of early works on second-order asymptotics in multi-terminal systems [541111111156) considered global 
rate regions, meaning that they were concerned with quantifying the sizes {Min, M 2 n) of length-n block codes 
with error probability not exceeding e. These sizes are called {n,e)-achievable. In the Slepian-Wolf context, 
a result by Tan-Kosut [156j states that are (n, e)-achievable iff 


log Min 


'nH{Xi\X2) 

log M2n 

e 

nH {X 2 \Xi) 

_l0g(Mi„M2„)_ 


nH {Xi,X 2 ) 


—^(V,e)-|-0 (logn) 1 


(6.53) 


where is an appropriate generalization of the func tion and 1 is the vector of all ones. The 

precise definition of 'I'~^(V,e), given in (8.24) and illustrated Fig. 8.3 will not be of concern here. 

While statements such as (6.53) are mathematically correct and are reminiscent of asymptotic expansions 
in the point-to-point case (cf. that for lossless source coding in (3.14|), they do not provide the complete 
picture with regard to the convergence of rate pairs to a fundamental limit, e.g., a corner point of the 
Slepian-Wolf region. Indeed, an achievability statement similar to (6.53) holds for the DM-MAC for each 
input distribution [841 Hill 11361 1156) and hence the union over all input distributions. However, one of the 
major deficiencies of such a statement is that the O(logn) third-order term is not uniform in the input 
distributions; this poses serious challenges in the interpretation of the result if we consider random coding 
using a sequence of input distributions that varies with the blocklength (cf. Chapter]^. Thus, as pointed out 
by Haim-Erez-Kochman [65) . for multi-user problems, the value of global expansions such as that in (6.53) is 
limited, and can only be regarded as stepping stones to obtain local results (if possible). Indeed, we do this 
for the Gaussian MAC with degraded message sets in Chapter [^ 

The main takeaway of this section is that one should adopt the local, weighted sum-rate, or dispersion- 
angle problem setups to analyze the second-order asymptotics for multi-terminal problems. These setups 
are information-theoretic in nature. In particular, operational quantities (such as the set C{e', R^, R 2 ) or the 
number F*(0, e; i?*, i?^)) are defined then equated to information quantities. 
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Chapter 7 

A Special Class of Gaussian 
Interference Channels 


This chapter presents results on second-order asymptotics for a channel-type network information theory 
problem. The problem we consider here is a special case of the two-sender, two-receiver interference channel 
(IC) shown in Fig. 7.1 This model is a basic building block in many modern wireless systems, so theoretical 


results and insights are of tremendous practical relevance. The IC was first studied by Ahlswede [2] who 
established basic bounds on the capacity region. However, the capacity region for the discrete memoryless 
and Gaussian memoryless cases have remained as open problems for over 40 years except for some very 
special cases. The best known inner bound is due to Han and Kobayashi m- A simplified form of the 
Han-Kobayashi inner bound was presented by Chong-Motani-Garg-El Gamal |26) . 

Since the determination of the capacity region is formidable, the derivation of conclusive results for the 
second-order asymptotics of general memoryless ICs is also beyond us at this point in time. One very special 
case in which the capacity region is known is the IC with very strong interference (VSI). In this case, the 
intuition is that each receiver can reliably decode the non-intended message which then aids in decoding the 
intended message. The capacity region for the discrete memoryless IC with VSI consists of the set of rate 
pairs (i?i,i? 2 ) satisfying 


Ri<I{Xi-Yi\X 2 ,Q), and i ?2 </(X 2 ; V 2 IA 1 ,Q) (7.1) 

for some Pq, Pxi\q and where Q is known as the timesharing random variable. In the Gaussian case 

in which Garleial [22] studied, the above region can be written more explicitly as 

R-i C(snri) and i ?2 < C(snr 2 ), (7.2) 


where snr,- is the signal-to-noise ratio of the direct channel from sender j to receiver j and the Gaussian 


capacity function is defined as C(snr) := ^ log(l -|-snr). See Fig. 7.2 for an illustration of the capacity region 


and the monograph by Shang and Chen BO] for further discussions on Gaussian interference channels. Gar- 
leial’s result is surprising because it appears that interference does not reduce the capacity of the constituent 
channels since C(snrj) is the capacity of the channel. In Carleial’s own words |22) . 

“Very strong interference is as innocuous as no interference at all. ” 


Similarly to the discrete case in (7.1), the (first-order optimal) achievability proof strategy for the Gaussian 


(7.21, 


case involves first decoding the interference, subtracting it off from the received channel output, and hnally, 
reliably decoding the intended message. The VSI condition ensures that the rate constraints in 
representing requirement for the second decoding steps to succeed, dominate. 

In this chapter, we make a slightly stronger assumption compared to that made by Garleial [22]. We 
assume that the inequalities that define the VSI condition are strict; we call this the strictly VSI (SVSI) 
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Figure 7.1: Illustration of the interference channel problem. 


assumption/regime. With this assumption, we are able to derive the second-order asymptotics of this class 
of Gaussian ICs. 

Although the main result in this chapter appears to be similar to the Slepian-Wolf case (in Chapter]^, 
there are several take-home messages that differ from the simpler Slepian-Wolf problem. 

1. First, similar to Carleial’s observation that for Gaussian ICs with VSI the capacity is not reduced, we 
show that the dispersions are not affected under the SVSI assumption. More precisely, the second-order 
coding rate region (a set similar to that for the Slepian-Wolf problem in Chapter , is characterized 
entirely in terms of the dispersions V(snrj) of the two direct AWGN channels from encoder j to decoder 

j; 

2. Second, the main result in this chapter suggests that under the SVSI assumption, and in the second- 
order asymptotic setting, the two error events (of incorrectly decoding messages I and 2) are almost 
independent; 

3. Third, for the direct part, we demonstrate the utility of an achievability proof technique by MolavianJazi- 
Laneman m that is also applicable to our problem of Gaussian ICs with SVSI. This technique is, in 
general, applicable to multi-terminal Gaussian channels. In the asymptotic evaluation of the informa¬ 
tion spectrum bound (Feinstein bound [53]), the problem is “lifted” to higher dimensions to facilitate 
the application of limit theorems for independent random vectors; 

This chapter is based on work by Le, Tan and Motani |103j . 


7.1 Definitions and Non-Asymptotic Bounds 

Let us now state the Gaussian IC problem. The Gaussian IC is defined by the following input-output relation: 


Yii — giiXii -|- gi2X2i + Zu, (7.3) 

Y2i = g2lXii + g22X2i + Z2i, (7.4) 


where i = and gjk are the channel gains from sender k to receiver j and Zn ^ A/'(0,1) and 

Z 2 i ~ A/'(0,1) are independent noise components^ Thus, the channel from (xi,X 2 ) to (j/i,?/ 2 ) is 


W{yi,y2\xi,X2) = ^exp 



yi 


511 

5l2 


Xi 

1 2 



521 

522_ 


.*2. 


(7.5) 


^The independence assumption between Zu and Z 2 i was not made in Carleial’s work m (i.e., Zii and Z 2 i may be correlated) 
but we need this assumption for the analyses here. It is well known that the capacity region of any general IC depends only 
on the marginals |49l Ch. 6] but it is, in general, not true that the set of achievable second-order rates C{£‘, R^)^ defined in 

| |7.17[ l, has the same property. This will become clear in the proof of Theorem |7.1| in the text following | |7.32[ |. 
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Let Wi and W 2 denote the marginals of W. The channel also acts in a stationary, memoryless way so 

n 

W^"(yi,y2|xi,X2) = ]JlT(2/ii,j/2i|a;n,a:2i). (7.6) 

i=r 

It will be convenient to make the dependence of the code on the blocklength explicit right away. 
We define an {n, Mi, M 2 , Si, S 2 ,e)-code for the Gaussian IC as four maps that consists of two encoders 
fj : {1,..., Mj} —>• = 1,2 and two decoders (pj : K" —>■ {1,..., Mj} such that the following power 

constraint^ are satished 

n 

I|/i("^i)|l2 = H ^ (7.7) 

and, denoting T’mi.ma := {(yi,y 2 ) : V^i(yi) = rui and ip 2 {y 2 ) = W 2 } as the decoding region for ( 7711 , 7712 ), 
the average error probability 

^ Ml M 2 

MkT.T. W"(M’^xM”\I?™,.^,|/i(mi),/2(m2)) <£. (7.8) 

1 2 mi = l 7712 = 1 

In ([^, Si and S 2 are the admissible powers on the codewords fi{mi) and /2(77i2). The signal-to-noise 
ratios of the direct channels are 


snri := and snr 2 := g 22 S 2 - 


The interference-to-noise ratios are 


inri := 31252 , and inr 2 := 321'S'i- 


(7.9) 


(7.10) 


We say that the Gaussian IC W, together with the transmit powers (5i, ^ 2 ), is in the VSI regime if the 
signal- and interference-to-noise ratios satisfy 


snri < 


inr2 

1 -f snr2 ’ 


and snr 2 < -, 

1 -t- snri 


(7.11) 


or equivalently, in terms of capacities, 


C(snri) -b C(snr2) < min{C(snri -b inri), C(snr2 -b 1012)}. 


(7.12) 


The Gaussian IC is in the SVSI regime if the inequalities in (7.11)-(7.12) are strict. Intuitively, the VSI (or 
SVSI) assumption means that the cross channel gains 312 and 321 are much stronger (larger) than the direct 
gains 311 and 322 for given transmit powers 5 i and S 2 - 

We now state non-asymptotic bounds that are evaluated asymptotically later. The proofs of these bounds 
are standard. See [SS] or m- 


Proposition 7.1 (Achievability bound for IC) 
satisfies the power constraints in (7.7), i.e., 


Fix any input distributions Px'p and Px" whose support 


A"II 2 < nSj with probability one. For every 71 S N, every 


7 > 0 and for any choice of (conditional) output distributions Qyp\x;^, QyP'\X'i, Qxp and Qy.^ there exists 
an {n, Ml, M 2 , Si, S 2 ,e)-code for the IC such that 


£ < Pr 


log 71 - ^ log 

CYp\x^(yi 1X2) 

, W?(Yfi\Xf,Xf) , 


^The notation fji{Tnj) denotes the coordinate of the codeword fj{Tnj) E ] 
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(7.13) 


log 


log 


W^iY^^\X^,XS) 

Oy,"(>T) 


< log(MiM 2 )+n 7 or 


< log(MiM2)+n7 ) +Cexp(-n7), 


where C := ELi ELi Cjk and 


. PxrW^r(yi|x 2 ) 

X2,yi Qy^lJCy (yi|x2) ’ 

Px^w^ {y2\^i) 


C 21 := sup 


^,y.QY,-\xriy2\xi)^ 


C 12 := sup 
yi 

C 22 := sup 


Px^PxsWr (yi) 


yi Qyi"(yi) 

Px^Px^w^{y2) 
y 2 Qn"(y 2 ) 


(7.14) 

(7.15) 


This is a generalization of the average error version of Feinstein’s lemma [53] (Proposition |4.1[ ). Notice 
that we have the freedom to choose the output distributions at the cost of having to control the ratios Qjk 
of the induced output distributions and our choice of output distributions. 

Proposition 7.2 (Converse bound for 1C). For every u G N, every 7 > 0 and for any ehoice of (conditional) 
output distributions Qyp\X" and Qy^\x^j every {n, Mi, M 2 , Si, S 2 ,e)-code for the IC must satisfy 


e > Pr 




< log Ml — n 7 or 


log 

^ Qy,"|X 2 "(E”l^ 2 ”) 

-gas*'*"-')— 


(7.16) 


for some input distributions Pvj* and Px^ whose support satisfies the power constraints in 

Observe the following features of the non-asymptotic converse, which is a generalization of the ideas of 
Verdii-Han |169[ Lem. 4] and Hayashi-Nagaoka [TT] Lem. 4]: First, there are only two error events compared 
to the four in the achievability bound. The SVSl assumption allows us to eliminate two error events in 
the direct bound so the two bounds match in the second-order sense. Second, we are free to choose output 
distributions without any penalty (cf. the achievability bound in Proposition 7.1). Third, the intuition 
behind this bound is in line with the SVSI assumption-namely that decoder 1 knows the codeword X 2 and 
vice versa. Indeed, the proof of Proposition |7.2| uses this genie-aided idea. 


7.2 Second-Order Asymptotics 

Similar to the study of the second-order asymptotics for the Slepian-Wolf problem, we are interested in 
deviations from the boundary of the capacity region of order 0{^) for the Gaussian IC under the SVSI 
assumption. This motivates the following definition. 

Let (i?);,i? 2 ) be a point on the boundary of the capacity region in (7.2). Let (Li,L 2 ) € be called 
an achievable (e, R\, R 2 )-second-order coding rate pair if there exists a sequence of (n, Min, M 2 n, Si, S 2 , £«)- 
codes for the Gaussian IC such that 


limsup En < £, 

n—>-oo 


1 


and liminf —j=(\ogMjn — nR*) > Lj, 


(7.17) 


for j = 1, 2. The set of all achievable (e, , i? 2 )"Second-order coding rate pairs is denoted as C{e-, i?*, i?^) C 

The intuition behind this definition is exactly analogous to the Slepian-Wolf case. 


Define Vj := V(snrj) where, recall from (4.79) that 


V(snr) = log^ e • 


snr(snr -4- 2) 
2(snr + 1)2 


(7.18) 


is the Gaussian dispersion function. 
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Figure 7.2: Illustration of the different cases in Theorem |7.1[ For brevity, we write Cj 


C(snrj) for j = 1, 2. 


Theorem 7.1. Let the Gaussian IC W, together with the transmit powers {Si, S 2 ), be in the SVSI regime. 
Depending on (see Fig. 7.2), there are 3 different cases: 

Case (i): = C(snri) and i ?2 < C(snr 2 ) (vertical boundary) 


C{e-, R*i, R*^) = {{Li,L 2 ) : Li < ^/Vi^-^s)}. 
Case (ii): R^ < C(snri) and i?2 = C(snr2) (horizontal boundary) 

C{e-,R*i,R;) = {{Li,L2) : L 2 < VV2^-\e)]. 
Case (Hi): i?* = C(snri) and i?2 = C(snr2) (corner point) 


Cle,Rl,m) = :<t(- ^)*(- ^) > 1-.}. 


(7.19) 


(7.20) 


(7.21) 


A proof sketch of this result is provided in Section 7.3 The region C{s; R\,Rf) for Case (iii) is sketched 
in Fig. |7.3| for the symmetric case in which Vi = ¥ 2 . 

A few remarks are in order: First, for Case (i), C{e-,R\,R 2 ) depends only on e and Vi. Note that 
\/Vi^~^{e) is the optimum (maximum) second-order coding rate of the AWGN channel (Theorem |4.4[ ) from 
Xi to Yi when there is no interference, i.e., 512 = 0 in (|7.3|). The fact that user 2’s parameters do not 


feature in (7.19) is because i ?2 < C(snr 2 ). This implies that the channel 2 operates in large deviations (error 
exponents) regime so the second constraint in (7.2) does not feature in the second-order analysis, since the 
error probability of decoding message 2 is exponentially small. An analogous observation was also made for 
the Slepian-Wolf problem in Chapter 

Second, notice that for Case (iii), C{e; i? 2 ) is a function of e and both Vi and V 2 as we are operating 


at rates near the corner point of the capacity region. Both constraints in the capacity region in (7.2) are 
active. We provide an intuitive reasoning for the result in (7.21). Let Qj denote the event that message 


j = 1, 2 is decoded correctly. The error probability criterion in (7.8) can be rewritten as 


Pr {Gi n 1 / 2 ) > 1 - £■ 


(7.22) 


Assuming independence of the events Gi and G 2 , which is generally not true in an IC because of interfering 
signals. 


Pr (^ 1 ) Pr (^ 2 ) > 1 - e. 

Given that the number of messages for codebook j satisfies 

Mjn = [exp (nR* + i/nL^ -|- o(i/n))J, 


(7.23) 


(7.24) 
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Figure 7.3: Illustration of the region £(e; i?*, i?^) in Case (iii) with £ = 10 The regions are to the bottom 
left of the boundaries indicated. 


the optimum probability of correct detection satisfies (cf. Theorem 4.4) 




(7.25) 


which then (heuristically) justifies (7.21). The proof makes the steps from (7.22)-(7.25) rigorous. Since Vi 
and V 2 are the dispersions of the Gaussian channels without interference, this is the second-order analogue 
of Carleial’s result for Gaussian ICs in the VSl regime |22] because the dispersions are not ajfeeted. Note 


that no cross dispersion terms are present in (7.21) unlike the Slepian-Wolf problem, where the correlation 


of two different entropy densities appears in the characterization of C{e; R^, R 2 ) for corner points (Rl^R^)- 
Finally, it is somewhat surprising that in the converse, even though we must ensure that the codewords 
X” and X 2 are independent, we do not need to leverage the wringing technique invented by Ahlswede [3], 
which was used to prove that the discrete memoryless MAG admits a strong converse. This is thanks to 
Gaussianity which allows us to show that the first- and second-order statistics of a certain set of information 


densities in (7.29)-(7.30) are independent of Xi and X 2 belonging to their respective power spheres. 


7.3 Proof Sketch of the Main Result 


The proof of Theorein |7.1| is somewhat long and tedious so we only sketch the key steps and refer the reader 
to m for the detailed calculations. 


Proof. We begin with the converse. We may assume, using the same argument as that for the point-to- 
point AWGN channel (cf. the Yaglom map trick [2^ Gh. 9, Thm. 6] in the proof of Theorem 4.4) that 
all the codewords Xj{mj) satisfy ||x_,(mj )||2 = nSj,j = 1,2. Ghoose the auxiliary output distributions in 
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Proposition |7.2| to be the n-fold products of 

QYi\x 2 {yi\x 2 ) ■■= M'iyi-, 912 x 2 ,gfiSi + 1 ), and 
QY 2 \Xi{y 2 \xi) '■= M [y 2 ', 921 X 1 , 922^2 + !)■ 


(7.26) 

(7.27) 


These are the output distributions induced if the input distributions and Px^ are n-fold products of 
A/’(0,S'i) and A/’(0, S' 2 ) respectively. Fix any achievable (e, i? 2 )"S 6 Cond-order coding rate pair (Li,L 2 ), 
i.e., (Li, L 2 ) S £(£; PJ, i?^)- Then, for every ^ > 0, every sequence of (n, Mi„, M 2 „, Pi, «S' 2 , e„)-codes satisfies 


logMj„ > nR* + y/n{Lj - ^), j = 1, 2, 

for n large enough. To keep our notation succinct, define the information densities 

, 1 Ti”(Fi"|xi,X 2 ) Wl{Yu\xu,X2^) 

ji(xi,X2,y"):=log -^ " i 


QFy|X"(h7|x2) ^ QYi\X 2 {Yu\x 2 i) ' 

■1 W^2”(^2”|xi,X2) _ W 2 (^2*,a;2*) 

, 2 (xi,X 2 ,y 2 ).= log. 

2 


Qf 2 "|X"(^ 2 "I^i) 


(7.28) 

(7.29) 

(7.30) 


Let Cj := C(snrj) for j = 1,2. For any pair of vectors (xi,X 2 ) satisfying HxjUf = nSj, 


E 

jl(Xi,X2, Y”) 
j2(Xl,X2, Y 2 ")_ 

= n 

'C{ 

C 2 J ’ 

and 

( 7 . 31 ) 

Cov 

jl(xi,X 2 , Y]”) 
_j2(xi,X2, Y 2 ”)_ 

= n 

Vi o' 

0 Y 2 , 


( 7 . 32 ) 


Importantly, notice that the covariance matrix in (7.32) is diagonal. This is due to the independence of the 
noises Zn and Z 2 i and is the crux of the converse proof for the corner point case in (7.21). 

Now let 7 := in the probability in the non-asymptotic converse bound in (7.16). We denote this 

probability as p. By the law of total probability, the complementary probability 1 — p can be written as 


1 —p= / Pr 


ji(xi,x 2 ,yi”) 

j 2 (xi,X 2 ,F 2 "') 


> 


logMi„-n^/^ 

logM2„-ni/4 


dPxj.(xi)dPx"(x 2 ). 


(7.33) 


By (7.28), for large enough n, the inner probability evaluates to 

Pr 

< Pr 



jl(xi,X 2 , Y]”) 


logMi„ - 


_j 2 (xi,X 2 , Y 2 ”)_ 


log M 2 n - 


jl(xi,X 2 , Yi"') 

j2(xi,X2, Ya”) 


> 


nRl - yRi{Li - 2 ^) 
nRl — v^(P2 — 2^) 


< 



'y/n{Ci -Rt)-Li+ 2( 

;0, 

'Vi o' 


yn(C 2 - P^) - P 2 + 2 ^ 

0 Y 2 . 


- P*) - L, + 2^ 


(7.34) 

(7.35) 

(7.36) 


where (7.35) is an application of the multivariate Berry-Esseen theorem (Corollary o and K is a finite 
constant. Note that 4' denotes the bivariate generalization of the Gaussian cdf, defined in (6.11). Equal¬ 
ity (7.36) holds because the covariance matrix in (7.35) is diagonal by the calculation in (|7.32). Since the 


bound in (7.36) does not depend on xi,X 2 as long as ||xj ||2 = nSj, we have 


1 -p < 

i=i 


\/n(Cj — R*) — Lj 2^ 

7W 


(7.37) 


87 





















































In Case (i), i?* = Ci and i ?2 < C 2 so the term corresponding to j = 2 in the above product converges to 
one and we have 

—Li + 2 ^ 


1 - p < $, ,_ 

where —>■ 0 as n —>■ 00 . Thus Proposition |7.2| yields 

' Ll- 2 ^ 


+ Sn 


> $ 


Taking lim sup on both sides yields 


limsupe^ > $ 

n—>-oo 


Li-2^ 


(7.38) 


(7.39) 


(7.40) 


Since limsup„ 


, < e, we can write 


Li < ^/i^$-l(e) + 2 e 


(7.41) 

Since ^ > 0 is arbitrarily small, we may take ^ 0 to complete the proof of the converse part for Case (i). For 

Case (ii), swap the indices 1 and 2 in the above calculation. For Case (hi), the analysis until (7.36) applies. 
However, now R* = Cj for both j = 1, 2 so both $(•) functions in (7.36) are numbers strictly between 0 and 
1. Consequently, we have 


1 - p < $ 


—Li + 2 ^ \ / —L 2 + 2 ^ 




$ 




In 


The rest of the arguments are similar to those for Case (i). 

For the direct part, similarly to the single-user case in ( |4.88 1 , we choose the input distributions 


Px-(xj)= 


J = l,2, 


(7.42) 


(7.43) 


An{^nSj) 

where (5{-} is the Dirac d-function and An{r) is the area of a sphere in K" with radius r. Clearly, the power 
constraints are satisfied with probability one. We choose the conditional output distributions Qy^\x^ and 
Qy^\X]^ as in (7.26) and (7.27) and the output distributions Qy^ and Qy^ to be the n-fold products of 

Qyi (j/i) := -^(j/i; 0, ghSi + §^ 2^2 + 1), 

QyAv^) := N{y2\^,gliSi + g22S2 + !)■ 


and 


(7.44) 

(7.45) 


With these choices of auxiliary output distributions, one can show the following technical lemma concerning 
the ratios of the induced (conditional) output distributions and the chosen (conditional) output distributions 
in Proposition |7.1| This is the multi-terminal analogue of (4.108) for the point-to-point AWGN channel and 
it allows us to replace the inconvenient induced output distributions and Px^Px^Wi (which is 

present in standard Feinstein-type achievability bounds, for example [ 66 ]) with the convenient Qy{^\x^ and 
Qy^ without too much degradation in error probability. 


Lemma 7.1. Let Qy^,Qy^,QYn\x^ and Qy^^x^ be defined as the n-fold products of those in (7.44), (7.45), 
(7.26) and (7.27) respectively. Then, there exists a finite constant C, such that the ratios Qk in (7.14)-(7.15) 
are uniformly bounded by ^ as n grows. Hence, their sum ^ fe=i Cjk is also uniformly hounded. 

The proof of this lemma can be found in [103] and m- 

Because A" and Xtf are uniform on their respective power spheres, it is not straightforward to analyze 
the behavior of random vector 


B 



" 

■Pll' 


P 21 


P 12 


to 

to 



log 

log 

log 

log 


W” 

(W 

IW 

.w")n 

Qy. 

•*tx? 

(W 

IW") 

wf 

(W 

|X," 

.W") 

Qy., 

Uxf 


1^1") 

w- 

(W 

|X" 

,W) 


Qy^ 

(W 

) 

IV" 

(W 

lA" 

.W") 



(^7 

) J 


(7.46) 











































which is present in (7.13). Note that B can be written as a sum of dependent random variables due to 
the product structure of the chosen output distributions. To analyze the probabilistic behavior of B for 
large n, we leverage a technique by MolavianJazi and Laneman m- The basic ideas are as follows: Let 
T” ~ Af{On, Inxn) for j = 1, 2 be standard Gaussian random vectors that are independent of each other and 
of the noises Z^. Note that the input distributions in (7.43) allow us to write Xji as 


Xji — 


T, 






i = 1 ,... ,n. 


(7.47) 


Indeed, ||-^"||2 = nSj with probability one from the random code construction and (7.47). Now consider 
the length -10 random vector 11 ^ := C/ioi), where 


Uiii 

Usii 

Ul2i 

U32i 


= 1-Zt 


— 9i2\/^T2iZii, 


= 1 - 


— 92i\/~^TiiZ2i, 


Ugi — 1, 


U2U 

Uili 

U 22 ^ 

Ui2i 

Uwi 


= 9ii\/^TiiZii, 

= 9ii9i2\/ SiS2TiiT2i, 
= 922\/^T2iZ2i, 

= 92l922\/~Sl^TiiT2i, 

— 7^2 _ 1 

— 2 1 . 


(7.48) 

(7.49) 

(7.50) 

(7.51) 

(7.52) 


Clearly, Ui is i.i.d. across channel uses. Furthermore, E[Ui] = 0 and E[||Ui|j 2 ] is finite. The covariance 
matrix of Ui can also be computed. Define the functions tii,ti 2 :. 


Tii(u) := snri un + 


2 u 


21 


\/l + Ug 


and 


D 2 (u) := (snri inri)Mii -h 


2 u' 


21 


2 w 


31 


•\/T+~Ug VT-Pltio 


2 u. 


'41 


■y/1 + Ug^yi + UiQ 


(7.53) 


(7.54) 


for user 1, and analogously for user 2. Then, through some algebra, one sees that Bn and Big can be written 
as 


Bn = nC(snri) -f 


on , 1 ), and 

2(1-1- snri) V n ' 

^ i—1 


Bi 2 = nC(snri -|- inri) -|- 


2(1 -f snri -I- inri) 


n 2 


■ 


2=1 


(7.55) 

(7.56) 


The other random variables in the B vector can be expressed similarly. 

From (7.55)-(7.56), we are able to see the essence of the MolavianJazi-Laneman |112] technique. The 
information densities Bjk,j,k = 1,2 were initially difficult to analyze because the input random vectors X" 
in (7.43) are uniform on power spheres. This choice of input distributions results in codewords X" whose 
coordinates are dependent so standard limit theorems do not readily apply. By defining higher-dimensional 
random vectors Ui and appropriate functions Tjk, one then sees that B can be expressed as a function of a 
sum of i.i.d. random vectors. Now, one may consider a Taylor expansion of the differentiable functions Tj^. 
around the mean 0 to approximate B with a sum of i.i.d. random vectors. Through this analysis, one can 
rigorously show that 



( 

C(snri) 



( 

Vi 

0 

* * 

\ 

1 

^/n 

B — n 

C(snr2) 
C(snri -I- inri) 


0 , 

0 

* 

^2 

* 

* * 

* * 


1 

C(snr2 -f inr2) 

} 


1 

* 

* 

* * 

/ 


(7.57) 


89 




































where the entries marked as * are finite and inconsequential for the subsequent analyses. Recall also that 
Vj = V(snr_,) for j = 1,2. In fact, the rate of convergence to Gaussianity in (7.57) can be quantified by 
means of Theorem 11.51 


With these preparations, we are ready to evaluate the probability in the direct bound in (7.13), which 
we denote as p. We consider all three cases in tandem. Fix (Li,L 2 ) G £{e; R^). Let the number of 

codewords in the codebook be 


for j = 1,2. It is clear that 

Also let 7 := With these choices, the complementary probability 1 — p can be expressed as 

1 — p = Pr 


Mjn = [exp {nR* + y/nLj — 271^/"^) J 

(7.58) 

lim inf {log — nR*) > L~. 

n—yoc yj'n ' J / 

(7.59) 


( 

nR\ + y/nLi — 

\ 

B > 

ni?2 + '/nL2 — 

n(^Ri + R 2 ) + y/n(^Li + L 2 ) — 3n^^‘^ 


V 

_n{Rl +R*2) + yn(Li + L 2 ) - 

/ 


(7.60) 


Now by the SVSI assumption in ( |7.11[ )-( |7.12[ ), 

+ i ?2 — C(snri) + C(snr 2 ) < min{C(snri + inri), C(snr 2 + inr 2 )}. 
The convergence in ( 7.57| ) implies that 

E[i?i 2 ] = nC(snri + inri), and E[i? 22 ] ='raC(snr 2 + inr 2 ). 


(7.61) 


(7.62) 


Since the expectations of B 12 and B 22 are strictly larger than Rl + R^ (cf. (7.61)), by standard Chernoff 
bounding techniques. 


Pr {Bi 2 <n{Rl+R 2 ) + '/n{Li+L 2 ) — 3n^^^) <exp(—u^), and 
Pr [B 22 <n{Rl+R 2 ) + Vn{Li+L2)-3n^^''') <exp(-u^). 


for some ^ > 0. Consequently, by the union bound, ( |7.60[ ) reduces to 
1 — p > Pr 


Bii 

B21 


> 


nRl + y/nLi — 

ni?2 + '/nL2 — 


— 2exp(—u^). 


(7.63) 

(7.64) 


(7.65) 


Just as in the converse, one can then analyze this probability for the various cases using the convergence to 
Gaussianity in (7.57). This completes the proof of the direct part. □ 
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Chapter 8 

A Special Class of Gaussian Multiple 
Access Channels 


The multiple access channel (MAC) is a communication model in which many parties would like to simul¬ 
taneously send independent messages over a common medium to a sole destination. Together with the 
broadcast, interference and relay channels, the MAC is a fundamental building block of more complicated 
communication networks. For example, the MAC is an appropriate model for the uplink of cellular sys¬ 
tems where multiple mobile phone users would like to communicate to a distant base station over a wireless 
medium. The capacity region of the MAC is, by now, well known and goes back to the work by Ahlswede [T] 
and Liao |105j in the early 1970s. The strong converse was established by Dueck [37] and Ahlswede |3]. 

A yet simpler model, which we consider in this chapter, is the asymmetric MAC (A-MAC) as shown in 
Fig. 8.1 This channel model, also known as the MAC with degraded message sets |49l Ex. 5.18(b)] or the 


cognitive [44] MAC, was first studied by Haroutunian Prelov |128j and van der Meulen [167] . Here, 
encoder 1 has knowledge of both messages mi and m 2 , while encoder 2 only has its own message m 2 . For 
the Gaussian case, the channel law is T = Ai -|- X 2 + Z, where Z is standard Gaussian noise. The capacity 
region UHl Ex. 5.18(b)] is the set of all (i?i,i? 2 ) satisfying 


Ri < C((l — p^)Si ), and Ri -I- R2 < C^ 5 'i -I- S2 -I- 2 p\/ S1S2') 


( 8 . 1 ) 


for some p € [0,1] where Si and S 2 are the admissible transmit powers. Rate pairs in (8.1) are achieved 
using superposition coding [3T|. This region for ^i = S '2 = 1 is shown in Fig. 8.2 Observe that p G [0,1] 


parametrizes points on the boundary. Each point on the curved part of the boundary is achieved by a unique 
bivariate Gaussian distribution. 

In this chapter, we show that the assumptions concerning Gaussianity and asymmetry of the messages 
sets (i.e., partial cooperation) allow us to determine the second-order asymptotics of this model. The main 
result here is of a somewhat different flavor compared to results in previous chapters on multi-terminal 
information theory problems because the second-order rate region £(e; i?*, i?^) is characterized not only in 
terms of covariances of vectors of information densities or dispersions. Indeed, we will see that there is a 


subtle interaction between the derivatives of the first-order capacity terms in (8.1) with respect to p, and 


the dispersions in the description of £(e; i?*, i?^)- The fact that the derivatives appear in the answer to an 
information-theoretic question appears to be novelj^ The difference in the characterization of £(e; i?]], i?5) 
compared to second-order regions in previous chapters is because, with the union over p G [0,1], the boundary 
of the capacity region in (8.1) is curved in contrast to the polygonal capacity regions in previous chapters. We 


will see that the curvature of the boundary results in the second-order region £(e; i?]], being a half-space 
in This half-space is characterized by a slope and intercept, both of which are expressible in terms of 
the dispersions, together with the derivatives of the capacities. 


^In fact, the dispersion of the compound channel [no] is a function of the dispersions of the constitudent channels and the 
derivatives of the capacity terms. 
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m2 



Figure 8.1: Illustration of the asymmetric MAC or A-MAC 


Intuitively, the extra derivative term arises because we need to account for all possible angles of approach 
to a boundary point (i?*, i? 2 ) ■ Using a sequence of input distributions parametrized by a single correlation 
parameter p not depending on the blocklength turns out to be suboptimal in the second-order sense, as we 
can only achieve the angles of approach within the specific trapezoid parametrized by p (see Fig. 8.2 and its 
caption). Thus, our coding strategy is to let the sequence of input distributions vary with the blocklength. 
In particular, they are parametrized by a sequence {pnjnGN that converges to p with speed 0(^). A Taylor 
expansion of the first-order capacity vector then yields the derivative term. 

Similarly to the Gaussian IC with SVSI, the achievability proof uses the coding on spheres strategy in 
which pairs of codewords are drawn uniformly at random from high-dimensional spheres. However, because 
the underlying coding strategy involves superposition coding, the analysis is more subtle. In particular, we 
are required to bound the ratios of certain induced output densities and product output densities. The proof 
of the converse part involves several new ideas including (i) reduction to almost constant correlation type 
subcodes] (ii) evaluation of a global outer bound and (iii) specialization of the global outer bound to obtain 
local second-order asymptotic results. 

The material in this chapter is based on work by Scarlett and Tan |138j . 


8.1 Definitions and Non-Asymptotic Bounds 

The model we consider is as follows: 


Yi — Xii + Xu -|- Zi 

(8.2) 

where i = 1,..., n and Zi ^ A/'(0,1) is white Gaussian noise. The channel gains are set to unity without loss 
of generality. Thus, the channel transition law is 

M^( 2 /|a:i,a: 2 ) = ^—exp ^ ^(y xi . 

(8.3) 

The channel operates in a stationary and memoryless manner. 

We define an {n, Mi, M 2 , Si, S 2 , e)-code for the Gaussian A-MAG which includes 
{!,..., Ml} X {1,..., M 2 } —> K", /a : {I,..., Maj —)■ K” and a decoder (/? : K" —)• {I,..., 
such that the following power constraints are satisfied 

two encoders fi : 
Ml} X {!,..., Maj 

n 

/i(mi,m 2 ) 2 = '^fiiimif < nSi, and 

(8.4) 

n 

/ 2 (m 2 ) 2 ='^f 2 z{m 2 f < nS 2 , 

2=1 

(8.5) 
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The Capacity Region 


(iii) 



Ri (nats/use) 


Figure 8.2: Capacity region of a Gaussian A-MAC where Si = S 2 = 1- The three cases of Theorem |8.1| are 
illustrated. Each p G (0,1] corresponds to a trapezoid of rate pairs achievable by a unique input distribution 
A/'(0, S(p)). However, coding with a fixed input distribution is insufficient to achieve all angles of approach 
to a boundary point as there are regions within C not in the trapezoid parametrized by p. Suppose p = |, one 
can approach the corner point in the direction indicated by the vector v using the fixed input distribution 
A/'(0, S(|)), but the same is not true of the direction indicated by v', since the approach is from outside the 
trapezoid. 
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and the average error probability 

Ml M 2 

E W X M \ ^mi,m 2 I/l (^15 ^2)5/ 2 (^ 2 )) ^ (^■^) 

1 2 mi=l 7712 = 1 

As with the Gaussian IC discussed in the previous chapter, 'Dmi,m 2 denotes the decoding region for messages 
{mi, m 2 ) and Sj represents the admissible power for the user. 

The following non-asymptotic bounds are easily derived. They are analogues of the bounds by Fein- 
stein [53] and Verdii-Han |169j (or Hayashi-Nagaoka [77]). See Boucheron-Salamatian [20] for the proofs of 
similar results. 


Proposition 8.1 (Achievability bound for the A-MAC). Fix any input joint distribution Px'^x^ whose 
support satisfies the power constraints in (8.4)-(8.5), i.e., ||A"||2 < nSj with probability one. For every 
n S N, every 7 > 0, any choice of output distributions Qy^\x^ o,nd Qy^, and any two sets Ai C x 
and Ai 2 C 3^", there exists an (n, Mi, M2, Si, S2, e)-code for the A-MAC such that 


e < Pr log 


where C = Ci + C12 and 

Cl := 




< log Ml+717 or 


< log(MiM2) +777 


QyAY-) 

+ Pr ((X^, F”) ^ Ai) + Pr (Y’^ i A12) + Cexp(-777), 


sup 

(x2,y)G^i 


Px^\xiW'-{y\^2) 
Qy^\x^ {y\y^2) 


C12 := sup 

ye +12 


P 


x^x: 




Qy"(y) 


(8.7) 


( 8 . 8 ) 


Again notice that our freedom to choose Qy'^\X'^ and Qv" results in the need to control Ci and C12, 
which are the maximum values of the ratios of the densities induced by the code with respect to the chosen 
output densities. The maximum values are restricted to those typical values of (x2,y) and y indicated by 
the chosen sets Ai and A12. 

Proposition 8.2 (Converse bound for the A-MAC). For every 71 € N, every 7 > 0 and for any choice of 
output distributions Qy'^\X" and Qy^, every {n. Mi, M 2 , Si, S 2 , £)-code for the A-MAC must satisfy 


“ ' QY.\x;{Y’lXi) - 


log , ."- M ^j+ CJ “ 7/. <log(MiM2)-n7 )-2exp(-7i7), 


(8.9) 


for some input joint distribution Px"X^ whose support satisfies the power constraints in (8.4)-(8.5). 


8.2 Second-Order Asymptotics 


As in the previous chapters on multi-terminal problems, given a point on the boundary of the capacity region 
(i?i,i?5), we are interested in characterizing the set of all {Li,L 2 ) pairs for which there exists a sequence of 
(n,Mi„, M2„, ^i, 5'2,en)-codes such that 


liminf —j={ logM,„ — nRj) > Lx, and limsupe„ < e. 

n-loo xjn'- n^oo 


( 8 . 10 ) 


We denote this set as £(£; i?i,i?^) C 
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8.2.1 Preliminary Definitions 


Before we can state the main results, we need to define a few more fundamental quantities. For a pair of 
rates (Ri,R 2 ), the rate vector is 


R := 


Ri 

R\ + i?2 


( 8 . 11 ) 


The input distribution to achieve a point on the boundary characterized by some p S [0,1] is a 2-dimensional 
Gaussian distribution with zero mean and covariance matrix 


S(p) 




pV ^2 

S 2 


( 8 . 12 ) 


The corresponding mutual information vector is given by 

I(P) = 

Let 

\l{x,y) := log^e- 


\Il{p)] 


[ C(5i(l-p2)) ] 

/l2(p) 


[c{Si+S2 + 2p^^)\ 


x{y + 2) 


(8.13) 


(8.14) 


2 (x -I- 1)(2/ -I- 1) 

be the Gaussian cross-dispersion function and note that V(a;) := V(a:, x) is the Gaussian dispersion function 
dehned previously in (4.79). For hxed 0 < p < 1, define the information-dispersion matrix 

^ Viip) 


V(p) := 


ViMp) 


Pl.l2(p) 

Vl2(p) 


where the elements of the matrix are 

Vi{p) 

ll,12(p) 
yi 2 {p) 

Let {Xi,X 2 ) ^ PxiX 2 =.^(0,S(p)) and define Qy\X 2 Qy to be Gaussian distributions induced by 
PxiX 2 and W, namely 


= V(5i(l-p2)), 

= V(5i(l - p^),S, + ^2 + 2px/^), 
= y{Si + S2 + 2p^/^2)- 


(8.15) 

(8.16) 

(8.17) 

(8.18) 


QY\X 2 iy\x 2 ) ■■= Af{y; X2{1 + p\/S 1 /S 2 ), 1 -f Sfil - p^)), and (8.19) 

Qriy) :=M{y,0,l + Si + S 2 + 2p./^2). (8.20) 

It should be noted that the random variables {Xi,X 2 ) and the densities Qy\X 2 and Qy all depend on p; 
this dependence is suppressed throughout the chapter. The mutual information vector I(p) and information- 
dispersion matrix V(p) are the mean vector and conditional covariance matrix of the information density 
vector 


r ■ / M flnr W^(wl“i.“2) I 

Ux, u) - 7 l(*l> 2^2, y) _ ^ QY\X 2 {y\^ 2 ) 

j[x^,X 2 ,y).- - 1 ■ 

-■ L * QY{y) J 

(8.21) 

That is, we can write I(p) and V(p) as 


I(/o) = E[j(Xi,X2,y)], and 

(8.22) 

V(p) = E[Cov(j(Xi,X2,r)|Xi,X2)], 

(8.23) 

with {Xi,X2,Y) ^ PxiX 2 X EF. We also need a generalization of the $“^(-) function, 
image” of '^{zi,Z2] 0, S) as 

Define the “inverse 

4'“^(S,e) := {(ti,t2) G R^ : 4 '(-zi,-Z2; 0, S) > 1 - e}. 

(8.24) 

An illustration of this set is provided in Fig. 8.3 Observe that for e < the set lies entirely within the 


third quadrant of the plane. This represents “backoffs” from the first-order fundamental limits. 
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/9 = 0.5 



Zi 


p = 0.995 



Zl 


Figure 8.3: Illustration of the set 'h ^(V(p),e) where V(p) is defined in 
bottom left of the boundaries indicated. 

8.2.2 Global Second-Order Asymptotics 

Here we provide inner and outer bounds on C(n, e), defined to be the set of pairs such that there 

exist codebooks of length n and rates at least Ri and R 2 yielding an average error probability not exceeding 
e. Let g{p,e,n) and g{p,e,n) be arbitrary functions of p, s and n for now, and define the inner and outer 
regions 

e,n)l|, (8.25) 

£,n)l|. (8.26) 

Lemma 8.1 (Global Bounds on the (n, e)-Capacity Region). There exist functions g{p,e,n) and g{p,e,n) 
such that 


n{n, G P) ■= { (i?i, i?2) : R e lip) + + gip, 

[ Vn 

Uin, e- p) := ((i?i, R2) : R S I(p) + 

I 


8.15). The regions are to the 


U S:(n,£;p) cC(n,e) c |J 7^(n,e;p), 

0<p<l -i<p<i 

and g and g satisfy the following properties: 

(i) For any sequence {p„}nGN with Pn ^ p G (—1,1), we have 


(8.27) 


giPn,e,n) = O 


logn 

n 


, and gipn,e,n) = 0 


logn 

n 


(8.28) 


(ii) Else, for any sequence {p„}nGN with Pn — t ±1, we have 


g{pn,e,n) = o 


, and gipn,e,n)=o 


/n 


(8.29) 


Lemma |8.1| serves as a stepping stone to establish the local behavior of first-order optimal codes near a 


boundary point. A proof sketch of the lemma is provided in Section 8.3.1 
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We remark that even though the union for the outer bound in (8.27) is taken over p e [—1,1], only the 
values p G [0,1] will play a role in establishing the local asymptotics in Section 8.2.3 since negative values 
of p are not even first-order optimal, i.e., they fail to achieve a point on the boundary of the capacity region. 

We do not claim that the remainder terms in (8.28)-(8.29) are uniform in the limiting value p of {pnjneNi 
such uniformity will not be required in establishing our main local result below. On the other hand, it is 
crucial that values of p varying with n are handled. 


8.2.3 Local Second-Order Asymptotics 

To characterize £(e; i?*, i?2)) we need yet another definition, which is a feature we have not encountered thus 
far in this monograph. Define 


D(P) 


'DM 

d 

h[p) 

Pl2{p) 

'■~Wp 

h2{.p)_ 


(8.30) 


to be the derivative of the mutual information vector with respect to p where the individual derivatives are 
given by 


dh{p) 

dp 

dli 2 {p) 

dp 


-Sip 


i + s,{i-p^y 


and 


1 + 5^1 + 5*2 + 2 / 5 ^/ S 1 S 2 


(8.31) 

(8.32) 


Note that p G (0,1] represents the strictly concave part of the boundary (the part of the boundary where 
i?2 > 0.2 in Fig. 8.2), and in this interval we have Di{p) < 0 and Di 2 {p) > 0. 

Furthermore, for a vector v = (ui,U2) G K^, we define the down-set of v as 


V := {{wi,W 2 ) G < Vi,W 2 < U2}. 

We are now in a position to state our main result whose proof is sketched in Section |8.3.2[ 


(8.33) 


Theorem 8.1 (Local Second-Order Rates). Depending on {Rl,Ry (see Fig. 8.2), we have the following 
three cases: 

Case (i): R^ = /i(0) and Rl R 2 < /i2(0) (vertical segment of the boundary corresponding to p = Q), 


C{e-,Rl,R*2) = {(Li,L 2) :Li < ^A^$-i(e)}. 


(8.34) 


Case (ii): R\ = Ii{p) and R{-\- R 2 = Ii 2 {p) (curved segment of the boundary corresponding to 0 < p < 1), 


C{e;Rl,R*2)={{L„L2) : 


Li 

Li-\-L 2 


IJ {/3D(p) + vI/-i(V(p),e)}l. 


(8.35) 




Case (Hi): i?) = 0 and R^ R 2 = 7i2(l) (point on the vertical axis corresponding to p = 1), 

C{e-,Rl,R*2) 

= i {Li,L 2 ) : 


Li 

Li L 2 


eU 

I3<0 


0 




(8.36) 


See Fig. |8.4| for an illustration of C{e; Rl, R 2 ) in (8.35) and the set of {Li,L 2 ) such that (Li,Li -I-L2) 
belongs to 4'~-^(V(p),£:), i.e., G4'”^(V(p), e), where G = [1, 0; —1,1] is the invertible matrix that transforms 
the coordinate system from [Li,Li -|- L 2 \' to [Li,L2]^- In other words, G'I'“^(V(p),e) is the same set as 
that in (8.35) neglecting the union and setting /3 = 0. It can be seen that G'I>“^(V(p),£) is a strict subset 
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Second-Order Region 



Figure 8.4: Illustration of the set £(£:; R2) ( 8.35| ) with Si 

set corresponding to /3 = 0 in (8.35) is denoted as Gd'~^(V(/9),e). 
boundaries. 


= 5'2 = 1, p = 5 and e = 0.1. The 
Regions are to the bottom-left of the 
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of C{e\ i?*, i?2)- In fact, £(e; i?*, i?2) is a half-space in for any (i?J, ii^) on the boundary of the capacity 
region corresponding to p < 1. So £(e;i?*,i?2) in ( |8.35 ) can be alternatively written as 

C{e-,RIRI) = {(£i,L2) : £2 < Opii +&p.e} (8.37) 


where the slope and intercept are respectively defined as 

Di2{p)-Di{p) 

— mi —' 

bp^e := inf {6 g]R:3Li gM s.t. 


(Li,(ap + l)Li+5) G G^'-i(V(p),e)}. 


(8.38) 

(8.39) 


8.2.4 Discussion of the Main Result 

Observe that in Case (i), the second-order region is simply characterized by a scalar dispersion term 14(0) 
and the inverse of the Gaussian cdf In this part of the boundary, there is effectively only a single rate 
constraint in terms of i?i, since we are operating “far away” from the sum rate constraint. This results in a 
large deviations-type event for the sum rate constraint which has no bearing on second-order asymptotics. 
This is similar to observations made in Chapters and 

Cases (ii)-(iii) are more interesting, and their proofs do not follow from standard techniques. As in Case 


(iii) for Theorem 6.1 the second-order asymptotics for Case (ii) depend on the dispersion matrix V(p) and 


the bivariate Gaussian cdf, since both rate constraints are active at a point on the boundary parametrized 
by p G (0,1). However, the expression containing 4“^ alone (i.e., the expression obtained by setting /3 = 0 


in (8.35)) corresponds to only considering the unique input distribution A/'(0, S(p)) achieving the point 
{rI,r*2) = (/i(p), 1^12(p) — Ii{p))- From Fig. |8.2[ this is not sufhcient to achieve all second-order coding 
rates, since there are non-empty regions within the capacity region that are not contained in the trapezoid 
of rate pairs achievable using a single Gaussian A/’(0, S(p)). 

Thus, to achieve all (Li,L2) pairs in C{e; Rl, R 2 ), we must allow the sequence of input distributions to 
vary with the blocklength n. This is manifested in the /3D(p) term. Roughly speaking, our proof strategy 
of the direct part involves random coding with a sequence of input distributions that are uniform on two 
spheres with correlation coefficient Pn = p + G(^) between them. By a Taylor expansion, the resulting 
mutual information vector 

I(p„) « I(p) + (p„ - p)D(p). (8.40) 

Since p„ —p = 0[^), the gradient term (p„ —p)D(p) also contributes to the second-order behavior, together 
with the traditional Gaussian approximation term 'I'~^(V(p),e). 

For the converse, we consider an arbitrary sequence of codes with rate pairs {(i?in, i?2n)}nGN converging 
to {Rt,R* 2 ) = {h{p) )Fi 2 (p) — Ii{p)) with second-order behavior given by ([^ From the global result, we 
know Rin + R 2 n]'^ G 77(n, e; p„) for some sequence {p„}„gN. We then establish, using the definition of 
the second-order coding rates in (8.10), that Pn = P + Finally, by the Bolzano-Weierstrass theorem, 

we may pass to a subsequence of pn (if necessary), thus establishing the converse. 

A similar discussion holds true for Gase (iii); the main differences are that the covariance matrix is 
singular, and that the union in (8.36) is taken over /3 < 0 only, since p„ can only approach one from below. 


8.3 Proof Sketches of the Main Results 


8.3.1 Proof Sketch of the Global Bound (Lemma 8.1) 


Proof. Because the proof is rather lengthy, we only focus on the case where pn —^ p G (—1,1). The main 
ideas are already present here. The case where p„ —±1 is omitted, and the reader is referred to |138j for 
the details. 
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The converse proof is split into several steps for clarity. In the first three steps, we perform a series 
of reductions to simplify the problem. We do so to simplify the evaluation of the probability in the non- 
asymptotic converse bound in Proposition |8.2| 

Step 1: (Reduction from Maximal to Equal Power Constraints) As usual, by the Yaglom map trick [551 
Ch. 9, Thm. 6], it suffices to consider codes such that the inequalities in (8.10) hold with equality. See the 


argument for the proof of the converse for the asymptotic expansion of the AWGN channel (Theorem |4.4[ ). 

Step 2: (Reduction from Average to Maximal Error Probability) Using similar arguments to |1191 
Sec. 3.4.4], it suffices to prove the converse for maximal (rather than average) error probability|^ This 
is shown by starting with an average-error code, and then constructing a maximal-error code as follows: (i) 
Keep only the fraction ^ of user 2’s messages with the smallest error probabilities (averaged over user I’s 

message); (ii) For each of user 2’s messages, keep only the fraction ^ of user I’s messages with the smallest 
error probabilities. 

Step 3: (Correlation Type Classes) Define Iq := {0} andife := (^)y^, ^], /e = 1,..., n, and let I-k '■= —Ik- 
Consider the correlation type classes (or simply type classes) 


Tn{k) := { (xi,X2) : 


(Xl,X2) 

IIX1II2IIX2II2 


€ Ik 


(8.41) 


where k = —n ,... ,n. The total number of type classes is 2n -I- 1, which is polynomial in n analogously to 
the finite alphabet case (cf. the type counting lemma). Using a similar argument to that for the asymmetric 
broadcast channel in |391 Lem. 16.2], and the fact that we are considering the maximal error probability so 
all message pairs {mi, m 2 ) have error probabilities not exceeding e (cf. Step 2), it suffices to consider codes 
for which all pairs (xi, X 2 ) that are in a single type class, say indexed by k. This results in a rate loss of Ri 
and i ?2 of only 0(^^^). We define /5 := - according to the type class indexed by k in (8.411. 


Step 4: (Approximation of Empirical Moments with True Moments) The value of p used in the single¬ 
letter information densities in (8.21) is arbitrary, and is chosen to be p. 

Using the definition of Tnik) and the information densities in ( |8.21 ), we can show that the first and 
second moments of ^ 2 i,Yi) are approximately given by I(/5) and V(p) respectively, i.e.. 



E 

"l n 

^ ^ j(^lz; ^2ii 
n 

L 2=1 

-I(/3) 

Cov 

' 1 "■ 

~1= '^3ixii,X2i,Yi) 

-V(p) 


< —, and 
n 




(8.42) 

(8.43) 


for some > 0 and ^2 > 0 not depending on p. The expectation and covariance above are taken with 


respect to lU"(-lxi,X 2 ). Roughly speaking, the reason for (8.42) and (8.43) is because all pairs of vectors 
in Tn{k) have approximately the same empirical correlation coefficient so the expectation and covariance of 
appropriately normalized information density vectors are also close to a representative mutual information 
vector and dispersion matrix respectively. 


Step 5: (Evaluation of the Non-Asymptotic Converse Bound in Proposition 8.2) Let R„ := -I- 

R^n]' (where Rjn = ^ logMj„) be the rate vector consisting of the non-asymptotic rates {Rin,R 2 n)- Addi¬ 
tionally, let 


R := 

Q ■■= 


log 


log 


iu"(y"ix",A”) 

QY^\x^{Y^\Xli) 

iu”(y”iwf,A^) 

Qv"(U") 


< log Mi„ - ny 


< log(Mi„M2n) - ny 


(8.44) 

(8.45) 


^This argument is not valid for the standard MAC, but is possible here due to the partial cooperation (i.e., user 1 knowing 
both messages). It is well known that the capacity regions for the MAC under the average and maximum error probability 
criteria are different, an observation first made by Dueck m- 
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be the two “error” events within the probability in (8.9). We then have 


Pr{J^ug) = 1 - Pr(J-" n ^ 1 ") 

= 1 - n g^\x^, x ^)]. 


(8.46) 

(8.47) 


In particular, using the definition oi ]{xi, X 2 ,y) in ( |8.21[ ) and the fact that and Qy^ are product 

distributions, the conditional probability in (8.47) can be bounded as 


PT{X^ng^\X^ =XyX^ =X2) 

y 


= Pr Y2i{xu,X2i,Yi) > R„ - ylj 

/ 1 n 

< Pr ( - ^ {i{xii,X 2 i,Yi) - m{xii,X 2 i,Yi)]^ 


> R„ - I(/5) - 7 I - —1 


(8.48) 


(8.49) 


where (8.49) follows from the approximation of the empirical expectation in (8.42). In the rest of this global 
converse proof, 7 is set to so exp(— 717 ) = in the non-asymptotic converse bound in (8.9). 

Applying the multivariate Berry-Esseen theorem (Corollary |1.I[ ) to (8.49) yields 

Pr(J-"ng"|A(‘=Xi,A 2 "=X 2 ) 


< 4- 


v^(^i(p) + 7 + Ci/n- -^In) 
_i/n(/l2(p) -|- 7 -f ^l/n — {Rln + R2n))_ ’ 


0, Cov 


-^'^j(Xli,X2^,Yi) 


V'(p) 


(8.50) 


where ip{p) is a constant. By Taylor expanding the continuously differentiable function (zi,Z 2 ,'V) 1 —>■ 
'I'(zi, 02 ) 0, V), and using the approximation of the empirical covariance in (8.43) together with the fact 
that det(V(/3)) > 0 for p S (—1,1), we obtain 


Pr{x^ng^\x^ = xyX^ = k 2 ) 
< 4- 


Vn{hip)+l + Ci/n- Rln) 

^/n(li2{p) -1-7-1- ^l/n - (i?i„ -I- i?2n)) 


;0,V(p) 


7?(p) log n 
\/n 


(8.51) 


where ri{p) is a constant. It should be noted that ijj{p)jri{p) —>■ 00 as p —>■ ±1, since V(p) becomes singular 
as p —?> ±1. Despite this non-uniformity, we conclude from ( |8.9| ), (8.47) and (8.51) that any (n,£:)-code with 
codewords (xi,X 2 ) all belonging to %i{k) must have rates (i?in, 7 ? 2 n) that satisfy 


Rln 

Rln 4“ R 2 n 


G I(p) 


4r 


-'(v(p), 


g _|_ 2 _ _|_ t)(p) logn 


(8.52) 


We immediately obtain the global converse bound on the (n, e)-capacity region (outer bound in (8.27) of 
Lemma 8.1) by employing the approximation 


4^ 


-■ (V(^),E+c 4 .-'(v(p),.) + 1 , 

V VnJ Vn 


(8.53) 


where c > 0 is an arbitrary finite constant and h(V{p),e,c) is finite for p ^ ±1. The details of the approxi¬ 
mation in (8.53) are omitted, and can be found in [138j . 
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We now provide a proof sketch of the achievability part of Lemma 8.1 (inner bound in ( 8.27[ )). At a high 
level, we will adopt the strategy of drawing random codewords on appropriate power spheres, similar to the 
coding strategy for AWGN channels (Section 4.3) and the Gaussian IG with SVSI (Ghapterj^. We then 
analyze the ensemble behavior of this random code. 

Step 1: (Random-Coding Ensemble) Let p G [0,1] be a fixed correlation coefficient. The ensemble will 
be defined in such a way that, with probability one, each codeword pair falls into the set 


^n(p) := |(xi,X2) : IIX1II2 = nSi, IIX2II2 = nS 2 , (xi,X2) = np^/S^j. 


(8.54) 


This means that the power constraints in (8.4)-(8.5) are satisfied with equality and the empirical correlation 


between each codeword pair is also exactly p. We use superposition coding, in which the codewords are 
generated according to 


{( 


M 2 


A2"(m2),{X"(mi,m2)C^i^ 

^ 7712 — 1 

M2 / Ml 

~ n (px-iMm2)) n Pxi\xs (xi (7711,7712)1x2(7712)) 

7712 — 1 ^ TTli—l 

for codeword distributions Px" and Pxi\x^- We choose the codeword distributions to be 

-Px"(x 2 ) oc 5{||x2||2 = ?t,S' 2 }, and 
-Px"|X"(x;i|x 2 ) oc ^{||xi|j| = nSi, (xi,X 2 ) = np^/S^}, 


(8.55) 


(8.56) 

(8.57) 


where <5{-} is the Dirac ^-function, and Px"(x) oc (5{x S A} means that Px"(x) = with the 

normalization constant c > 0 chosen such that /_^Px’*(x)dx =1. In other words, each X 2 (m 2 ) is drawn 
uniformly from an (n—l)-sphere with radius ^/nS 2 and for each m 2 , each Xi(mi, m 2 ) is drawn uniformly from 
the set of all Xi satisfying the power and correlation coefficient constraints with equality. These distributions 
clearly ensure that the codeword pairs belong to Pn(p) with probability one. 


Step 2: (Evaluation of the Non-Asymptotic Achievability Bound in Proposition 8.1) We now need to 
identify typical sets of (x 2 ,y) and y such that the maximum values of the ratios of the densities Ci and C 12 , 
defined in (8.8), are uniformly bounded on these sets. For this purpose, we leverage the following lemma. 


Lemma 8.2. Consider the setup of Proposition 8.1. where the output distributions are chosen to he Qy^\x^ '■= 
(PX 1 IX 2 W)"' and Qy^ '■= {PxixA^)'^ with PxiX 2 ■= A/'(0, S(p)), and the input joint distribution Px"xj is 
described by (8.56)-( 8.57| ). There exist sets Ai C AJ* x 3^" and A 12 C IV" {depending on n and p) such that 
the following 


max max{Ci,Ci 2 } < A 
pe[o.i] 


max max { Pr ((X^, E") ^ Ai ), Pr (E" ^ yli 2 )} < exp(— 77 ^), 


(8.58) 

(8.59) 


hold for all n > tiq, where where A < 00 , ^ > 0 and rig £ N are constants not depending on p. 


The proof of this technical lemma is omitted and can be found in [138] . It extends and refines ideas in 
Polyanskiy-Poor-Verdii’s proof of the dispersion of AWGN channels [1231 Thm. 54 & Lem. 61]. 

Note that the uniformity of the bounds A and exp(—n^) in (8.58)-(8.59) in p is crucial for handling p 
varying with 77, as is required in Lemma |8.1| 

Equipped with Lemma 8.2 we now apply the multivariate Berry-Esseen theorem (Corollary [13 to 
estimate the probability in the non-asymptotic achievability bound in Proposition |8.1| This computation is 
similar to that sketched in the converse proof with ^ 1=^2 = 0. This concludes the achievability proof of 
Lemma 18.11 □ 
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8.3.2 Proof Sketch of the Local Result (Theorem 8.1) 

Proof. We begin with the converse proof. We only prove the result in Case (ii), because Case (i) is standard 


(follows from the single-user case in Theorem 4.41 and Case (hi) similar to Case (ii). 

Step 1: (Passage to a Convergent Subsequence) Fix a correlation coefficient p S (0,1], and consider any se¬ 
quence of (n, Min, M 2 n, S*!, S' 2 ,£n)-codes satisfying (8.10). Let us consider the associated rates {(i?i„, i?2n)}neN- 
As required by the definition of second-order rate pairs (Li,L 2 ) € £(e; i?*, i? 2 ), these codes must satisfy 


lim inf > R *, 

n—>-oo 


liminf — i?*) > L,-, ?' = 1 , 2 , 

-K)_ \ J J / J 


lim sup En < £ 

n—>-oo 


(8.60) 

(8.61) 

(8.62) 


for some {Rl,R 2 ) on the boundary parametrized by p, i.e., = Ii{p) and Rl + R 2 = Ii 2 {p)- The first- 

order optimality condition in (8.60) is not explicitly required by (8.10), but it is implied by ( |8.61 ). Letting 
R„ := [Rin, Rin+R 2 nY be the non-asymptotic rate vector, we have, from the global converse bound in (8.27), 
that there exists a (possibly non-unique) sequence {pnlneN C [— 1 , 1 ] such that 


Rn G I(Pri) H-^^ _|_ gi^pn, E, n)l. 


(8.63) 


Since we used the lim inf for the rates and lim sup for the error probability in the conditions in (8.60|-( 8.62[ ), 
we may pass to a convergent (but otherwise arbitrary) subsequence of {pnjraeN; say indexed by {nijjgN- 
Recalling that the lim inf (resp. lim sup) is the infimum (resp. supremum) of all subsequential limits, any 
converse result associated with this subsequence also applies to the original sequence. Note that at least one 
convergent subsequence is guaranteed to exist, since [— 1 , 1 ] is compact. 

For the sake of clarity, we avoid explicitly writing the subscript 1. However, it should be understood that 
asymptotic notations such as O(-) and (•)„ —)■ (•) are taken with respect to the convergent subsequence. 

Step 2: (Establishing The Convergence of pn to p) Althoug h gjo n, e, n) in ( 8.63[ ) depends on we know 
from the global bounds on the (n, e)-capacity region (Lemma 8.1) that it is o(^) for both pn —>■ ±1 and 
Pn —> p & (—1,1). Hence, 

R„ G l{pn) + ^ + o ^ 1 . (8.64) 


We claim that the result in (8.64) allows us to conclude that every sequence {pnjnGN that serves to 


parametrize an outer bound of the non-asymptotic rates in (8.63) converges to p. Indeed, since the bound¬ 


ary of the capacity region is curved and uniquely parametrized by p for p € ( 0 , 1 ], p„ p implies for some 
77 > 0 and for all sufficiently large n that either Ii{pn) < Ii{p) — p or /i 2 (pn) < .^ 12 (p) — P- Combining this 


with (8.64), we deduce that 


Rln < L(p) — or i?i„ -I- i?2n < Il2{p) — ^ 


(8.65) 


for sufficiently large n. This, in turn, contradicts the convergence of (i?i„, i? 2 n) to (R^jR^) implied by ( 8.10| ). 

Step 3: (Establishing The Convergence Rate of p„ to p) Because each entry of I(p) is twice continuously 
differentiable, a Taylor expansion yields 


I(p„) = I{p) + 'D{p){pn - p)+0{{pn - pf)l. 

In the case that p„ — p = a;(^), it is not difficult to show that ( |8.64 ) and ( | 8 . 66 ) imply 

Rn < I(p) + D(p)(p„ - p) -I- o{pn - p)l. 


( 8 . 66 ) 


(8.67) 


Since the first entry of D(p) is negative and the second entry is positive, (8.67) states that Li = -l-oo (i.e., 
a large addition to Rl) only if Li -|- L 2 = —00 (i.e., a large backoff from R^ + R 2 ), and Li -I- L 2 = +00 
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only if Li = —oo. This is due the fact that we only consider second-order deviations from the boundary of 
the capacity region of the order 0(^). Neglecting these degenerate cases as they are already captured by 
(cf. Fig. 8.4), in the remainder, we focus on case where pn — p = 


Theorem 


i.l 


Step 4: (Completion of the Proof) Assuming now that pn — p = O(^), we can use the Bolzano- 
Weierstrass theorem to conclude that there exists a (further) subsequence indexed by (say) such 

that v^fifc(/Onfc — p) —>■ /3 for some (3 gR. Then, for the blocklengths indexed by Uk, by combining (8.64) and 
we have 


V^(R„, - I(p)) e /3D(p) + vI/-i(V(p),£) + 0(1) 1. 


( 8 . 68 ) 


Here we have also used the fact that the set-valued function p i—^ 4'^^ (V(p), e) is “continuous” to approximate 
4 '^^(V(p„),£) with 'I'“^(V(p),£). The details of this technical step are omitted, and can be found in |138) . 
By referring to the second-order optimality condition in (8.61), and applying the definition of the limit 


inferior, we know that every convergent subsequence of {Rjn}neN a subsequential limit that satisfies 
limfc_).oo — Rj) ^ Lj for j = 1 , 2 . In other words, for all 7 > 0 , there exists an integer Kj such 

that ^/nk{Rjnk — Rj) > Lj — for all k > Kj. Thus, for all k > maxlATi, 1 ^ 2 }, we may lower bound 
each component in the vector on the left of ( 8 . 68 ) with Li — 7 and Li -|- L 2 — 27 . There also exists an 


integer K 3 such that the o(l) terms are upper bounded by 7 for all k > K^. We conclude that any pair of 
(e, i?*, i? 2 )-achievable second-order coding rates (Li,L 2 ) must satisfy 


Li-27 

Li+ L2- 37 


U {^D(p)-fvi,-i(V(p),e)}. 


(8.69) 


Finally, since 7 > 0 is arbitrary, we can take 7 —^ 0 , thus completing the converse proof for Case (ii). 
The achievability part is similar to the converse part, yet simpler. Specifically, we can simply choose 


Pn ■— P ~\~ !— ) 
Jn 


and apply the above arguments based on Taylor expansions. 


(8.70) 

□ 


8.4 Difficulties in the Fixed Error Analysis for the MAC 

We conclude our discussion by discussing the difficulties in performing fixed error probability analysis for 
the discrete memoryless or Gaussian MACs (with non-degraded message sets). 

First, it is known that the capacity region of the MAC depends on whether one is adopting the average 
or maximal error probability criterion. The capacity regions are, in general, different |46j . In Step 2 of the 
converse proof, we performed an important reduction from the average to the maximal error probability 
criterion. This is one obstacle for any (global or local) converse proof for fixed error analysis of the MAC. 

Second, in the characterization of the capacity region of the discrete memoryless MAC, one needs to 
involve an auxiliary time-sharing random variable Q |491 Sec. 4.5]. At the time of writing, there does not 
appear to be a principled and unified way to introduce such a variable in strong converse proofs (unlike weak 
converse proofs [45]1. 

Finally, for the discrete memoryless MAC, one needs to take the convex closure of the union over input 
distributions Pxi\q,Px2\Q for a given time-sharing distribution Pq (33 Sec. 4.5]. In the absence of the de¬ 
graded message sets (or asymmetry) assumption, one needs to develop a converse technique, possibly related 
to the wringing technique of Ahlswede [3], to assert that the given codewords pairs are almost independent 
(or almost orthogonal for the Gaussian case). By leveraging the degraded message sets assumption, we cir¬ 
cumvented this requirement in this chapter but for the MAC, it is not clear whether the wringing technique 
yields a redundancy term that matches the best known inner bound to the second-order region | 112 l I136j . 
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Chapter 9 


Summary, Other Results, Open 
Problems 

9.1 Summary and Other Results 

In this monograph, we compiled a list of conclusive fixed error results in information theory. We began our 
discussion with a review of binary hypothesis testing and used the asymptotic expansions of the e-information 
spectrum divergence and e-hypothesis testing divergence for product measures to derive similar asymptotic 
expansions for the minimum code size in lossless data compression. Lossy data compression and channel 
coding were discussed in detail next. These subjects culminated in our derivation of an asymptotic expansion 
for the source-channel coding rate. We then analyzed various channel models whose behaviors are governed 
by random states. 

In this monograph, we also discussed a small collection of problems in multi-user information theory [49], 
where we were interested in quantifying the optimum speed of rate pairs converging towards a fixed point 
on the boundary of the (first-order) capacity region in the channel coding case, or optimal rate region in 
the source coding case. We saw three examples of problems in network information theory where conclusive 
results can be obtained in the second-order sense. These included the distributed lossless source coding 
(Slepian-Wolf) problem, as well as some special classes of Gaussian multiple-access and interference channels. 

We conclude our treatment of fixed error asymptotics in information theory by mentioning related works 
in the literature. 

9.1.1 Channel Coding 

Early works on fixed error asymptotics in channel coding by Dobrushin [45] , Kemperman m and Strassen |152] 
were discussed in Chapter]^ The interest in asymptotic expansions was revived in recent years by the works 
of Hayashi ESI and Polyanskiy-Poor-Verdii m- Before these prominent works came to the fore, Baron- 
Khojastepour-Baraniuk |14j considered the rate of convergence to channel capacity for simple channel models 
such as the binary symmetric channel. 

In this monograph, we did not discuss channels with feedback or variable-length terminations, both 
of which are important in practical communication systems. Polyanskiy-Poor-Verdii |125j studied various 
incremental redundancy schemes and derived several asymptotic expansions. Generally, the 0(yTi) dis¬ 
persion term is not present, showing that channels with feedback perform much better than without the 
feedback, an observation that is also corroborated by a more traditional error exponent analysis [211173]. 
Williamson-Ghen-Wesel |179j showed that their proposed reliability-based decoding schemes for variable- 
length coding with feedback can achieve higher rates than |125j . Altug-Wagner [5] showed that full output 
feedback improves the second-order term in the asymptotic expansion for channel coding if Vnin < 14nax- 
Tan-Moulin |158j considered the second-order asymptotics of erasure and list decoding. This analysis is the 
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fixed error probability analogue of Forney’s analysis of erasure and list decoding from the error exponents 
perspective [SS]. Erasure decoding is intimately connected to decision feedback or automatic retransmission 
request (ARQ) schemes as the declaration of an erasure event at the decoder can inform the encoder to 
resend the erased information bits. 

Shkel-Tan-Draper [14711148) considered the unequal error protection of message classes and related the 
asymptotic expansions for this problem to lossless joint source-channel coding |149j . Moulin |113j studied 
the asymptotics for the channel coding problem up to the fourth-order term using strong large deviation 
techniques [43l Thm. 3.7.4]. Matthews |108j made an interesting observation concerning the relation of 
the non-asymptotic channel coding converse (Proposition 4.4) to so-called non-signaling codes in quantum 
information. He demonstrated efficient linear programming-based algorithms to evaluate the converse for 
DMCs. 

Other (rather more unconventional) works on fixed error asymptotics for point-to-point communication 
include Riedl-Coleman-Singer’s analysis of queuing channels Polyanskiy-Poor-Verdu’s analysis of the 

minimum energy for sending k bits for Gaussian channels with and without feedback [126] . and Ingber- 
Zamir-Feder’s analysis of the infinite constellations problem |88) . 


9.1.2 Random Number Generation, Intrinsic Randomness and Channel Resolv¬ 
ability 

The problem of intrinsic randomness is to approximate an arbitrary source with uniform bits while random 
number generation is the dual, i.e., that of generating uniform bits from a given source [571 Ch. 2] [7T]. These 
problems were treated from the fixed approximation error (in terms of the variational distance) perspective 
by Hayashi iza, and Nomura-Han uni. An interesting observation made by Hayashi in m is that the 
folklore theoren|^ posed by Han [68] does not hold for the variational distance criterion. This is interesting, 
because the first-order fundamental limit for source coding and intrinsic randomness is the same, i.e., the 
entropy rate [571 Ch. 2] (at least for sources that satisfy the Shannon-McMillan-Breiman theorem). Thus, 
the violation of the folklore theorem for variational distance appears to be distillable only from the study 
of second- and not first-order asymptotics, demonstrating additional insight one can glean from studying 
higher-order terms in asymptotic expansions. 

The channel resolvability problem consists in approximating the output statistics of an arbitrary channel 
given uniform bits at the input (57] Ch. 6] [7T]. It is particularly useful for the strong converse of the 
identification problem [^ . Watanabe and Hayashi |176j considered the channel resolvability problem, proving 
a second-order coding theorem under an “information radius” condition not dissimilar to what is known for 
channel coding [551 Thm. 4.5.1]. 


9.1.3 Channels with State 


For channels with random state, Watanabe-Kuzuoka-Tan [177j and Yassaee-Aref-Gohari [188] derived the 
best non-asymptotic bounds for the Gel’fand-Pinsker problem, improving on those by Verdii [168] . With 
these bounds, one can easily derive achievable second-order coding rates by appealing to various Berry- 
Esseen theorems. The technique in [T771 is based on channel resolvability and channel simulation m 
while that in [188] is based on an elegant coding scheme known as the stochastic likelihood decoder (also 
known as the “pretty good measurement” in quantum information), which is also applicable to other multi¬ 
terminal problems such as multiple-description coding and the Berger-Tung problem [49] . Scarlett [134] also 
considered the second-order asymptotics for the discrete memoryless Gel’fand-Pinsker problem and used 


ideas in Section 5.2 to evaluate the best known achievable second-order coding rates based on constant 
composition codes. 

Polyanskiy m derived the second-order asymptotics for the compound channel where the channel 
state is non-random in contrast to the models studied in Chapter [^ Similar to the Gaussian MAC with 


^The folklore theorem EH] of Han states that “the output from any source encoder working at the optimal coding rate with 
asymptotically vanishing probability of error looks almost completely random.” 
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degraded message sets, the second-order term depends on the variance of the channel information density 
and the derivatives of the mutual informations. Finally, Hoydis et al. [551IH3] considered block-fading MIMO 
channels. In contrast to Section |5.5[ here the channel matrix is not quasi-static and so the analysis is 
somewhat more involved and requires the use of random matrix theory. 


9.1.4 Multi-Terminal Information Theory 


The advances in the second-order asymptotics for multi-terminal problems have been modest. Early works 
include those by Sarvotham-Baron-Baranuik |131[ I132j and He et al. [SD] for the single-encoder Slepian-Wolf 
problem. However, unlike our treatment in Chapter there is only one source to be compressed, and full 
side-information is available at the decoder. 

Other authors [84llllH[ll2Lll36] also considered inner bounds to the (n, e)-rate regions (also called global 
achievability regions) for the discrete memoryless and Gaussian MAGs, but it appears that conclusive results 
are much harder to derive without any further assumptions on the channel model. These are multi-terminal 
channel coding analogues of the corresponding discussion for Slepian-Wolf coding in Section 6.4.3 See further 
discussions in Section [9.2.31 


9.1.5 Moderate Deviations, Exact Asymptotics and Saddlepoint Approxima¬ 
tions 

The study of second-order coding rates is intimately related to moderate deviations analysis. In the former, 
the error probability is bounded above by a non-zero constant and optimal rates converge to the first-order 
fundamental limit with speed 0(^). In the latter, the error probability decays to zero sub-exponentially 

and the optimal rates converge to the first-order fundamental limit slower than 0(^). The dispersion 
also appears in the solution of the moderate deviations analysis because the second derivative of the error 
exponent (reliability function) is inversely proportional to the dispersion. The study of moderate deviations 
in information theory started with the work by Altug-Wagner [5] and Polyanskiy-Verdii |127j on channel 
coding. Sason |133j . Tan |154| and Tan-Watanabe-Hayashi |160j considered the binary hypothesis testing, 
rate-distortion and lossless joint source-channel coding counterparts respectively. 

In Section 4.4 we mentioned efforts from Altug-Wagner [101 E] and Scarlett-Martinez-Guillen i Fabregas 


[135] in deriving the exact asymptotics for channel coding. The authors were motivated to find the prefactors 
in the error exponents regime for various classes of DMCs. Scarlett-Martinez-Guillen i Fabregas [137] recently 
demonstrated that results concerning second-order coding rates, moderate deviations, large deviations, and 
even exact asymptotics may be unified through the use of so-called saddlepoint approximations. 


9.2 Open Problems and Challenges Ahead 

Clearly, there are many avenues of further research, some of which we mention here. We also highlight some 
challenges we foresee. 


9.2.1 Universal Codes 


In Section 3.3 we analyzed a partially universal source code that achieves the source dispersion (varentropy). 
The source code only requires the knowledge of the entropy and the varentropy. The channel dispersion can 
also be achieved using partially universal channel codes as discussed in the paragraph above (4.56). However, 
the third-order terms are much more difficult to quantify. It would be interesting to pursue research in the 
third-order asymptotics of source and channel coding for fully universal codes to understand the loss of 
performance due to universality. Work along these lines for fixed-to-variable length lossless source coding 
has been carried out by Kosut and Sankar [T7)ni[TT)T| . 
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9.2.2 Side-Information Problems 


Watanabe-Kuzuoka-Tan m and Yassaee-Aref-Gohari |188j derived the best known achievability bounds for 
side-information problems including the Wyner-Ahlswede-Korner (WAK) problem [71 1182j and the Wyner- 
Ziv |184j problem. However, non-asymptotic converses are difficult to derive for such problems which involve 
auxiliary random variables. Even when they can be derived, the evaluation of such converses asymptotically 
appears to be formidable. 

Because a second-order converse implies the strong converse, it is useful to first understand the tech¬ 
niques involved in obtaining a strong converse. To the best of the author’s knowledge, there are only three 
approaches that may be used to obtain strong converses for network problems whose first-order (capacity) 
characterization involves auxiliary random variables. The first is the information spectrum method |67j . For 
example, Boucheron and Salamatian (Ml Lem. 2] provide a non-asymptotic converse bound for the asym¬ 
metric broadcast channel. However, the bound is neither computable nor amenable to good approximations 
for large or even moderate blocklengths n as one has to perform an exhaustive search over the space of all 
n-letter auxiliary random variables. The second is the entropy and image size characterization technique 
[5| based on the blowing-up lemma [51 1107] . (Also see the monograph |129j for a thorough description of 
this technique.) This has been used to prove the strong converse for the WAK problem |5], the asymmetric 
broadcast channel jS], the Gel’fand-Pinsker problem |lti6] and the Gray-Wyner problem |Mj. However, the 
use of the blowing-up approach to obtain second-order converse bounds is not straightforward. The third 
method involves a change-of-measure argument, and was used in the work of Kelly and Wagner jSni Thm. 2] 
to prove an upper bound on the error exponent for WAK coding. Again, it does not appear, at first glance, 
that this argument is amenable to second-order analysis. 

A problem similar to side-information problems such as Gel’fand-Pinsker, Wyner-Ziv and WAK is the 
multi-terminal statistical inference problem studied by Han and Amari |69j among others. Asymptotic 
expansions with non-vanishing type-H error probability may be derivable under some settings (using es¬ 
tablished techniques), if the first-order characterization is conclusively known, and there are no auxiliary 
random variables, e.g., the problem of multiterminal detection with zero-rate compression [139] . 


9.2.3 Multi-Terminal Information Theory 


The study of second-order asymptotics for multi-terminal problems is at its infancy and the problems de¬ 
scribed in this monograph form only the tip of a large iceberg. The primary difficulty is our inability to deal, 
in a systematic and principled way, with auxiliary random variables for the (strong) converse part. Thus, 
genuinely new non-asymptotic converses need to be developed, and these converses have to be amenable 
to asymptotic evaluations in the presence of auxiliary random variables. As an example, for the degraded 
broadcast channel, the usual non-intuitive identification of the auxiliary random variable by Gallager m 
(see [IHl Thm. 5.2]) for proving the weak converse does not suffice as the strong converse is implied by a 
second-order converse. Other possible techniques, such as information spectrum analysis |2Qj or the blowing- 
up lemma j^, were highlighted in the previous section. Their limitations were also discussed. For the discrete 
memoryless MAC, a strong converse was proved by Ahlswede |3| but his wringing technique does not seem 
to be amenable to second-order refinements as discusseed in Section [8^ 

In contrast to the single-user setting, constant composition codes may be beneficial even in the absence 
of cost constraints for discrete memoryless multi-user problems. This is because there does not exist an 


analogue of the relation in (4.241, where the unconditional information is equal to the conditional information 
variance for all CAIDs. Scarlett-Martinez-Guillen i Fabregas [136] provided the best known inner bounds 
to the (n, e)-rate region for the discrete memoryless MAC. Tan-Kosut |157j also showed that conditionally 
constant composition codes also outperforms i.i.d. codes for the asymmetric broadcast channel when the error 
probability is non-vanishing. It would be fruitful to continue pursuing research in the direction of constant 
composition codes for multi-user problems (e.g., the interference channel) to exploit their full potential. 


108 





















9.2.4 Information-Theoretic Security 

Finally, we mention that within the realm of information-theoretic security [m HH], there are several 
partial results in the fixed error and leakage setting. Yassaee-Aref-Gohari |187[ Thm. 4] used a general 
random binning procedure, called output statistics of random binning^ to derive a second-order achievability 
bound for the wiretap channel [183j , improving on earlier work by Tan [153j . The constraints pertain to the 
error probability of the legitimate receiver in decoding the message and the leakage rate to the eavesdropper 
measured in terms of the variational distance. However, in [187] . there were no converse results even for the 
less noisy (or even degraded) case where there are no auxiliary random variables. 

The most conclusive work in thus far in information-theoretic security pertains to the secret key agreement 
model [1], where the second-order asymptotics were derived by Hayashi-Tyagi-Watanabe m- Interestingly, 
the non-asymptotic converse bound relates the size of the key to the e-hypothesis testing divergence, similar 
to some point-to-point problems as discussed in this monograph. The non-asymptotic direct bound is derived 
based on the information spectrum slicing technique (e.g., m Thm. 1.9.1]). The author believes that the 
fixed error and fixed leakage analysis for the wiretap channel, leveraging the secret key result, may lead to 
new insights into the design of secure communication systems at the physical layer. For converse theorems, 
the development of novel strong converse techniques for the wiretap channel appears to be necessary; there 
are recent results on this for degraded wiretap channels using the information spectrum method |155j and 
active hypothesis testing [79]. 
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